[2025-11-13 08:04:09,151][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:09,975][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:04:09,982][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:11,042][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:06:20,332][__main__][INFO] - Starting iteration 0. [2025-11-13 08:06:20,336][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:20,337][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:25,014][__main__][INFO] - Number of regex retries in iteration 0: 0 [2025-11-13 08:06:25,014][__main__][INFO] - agents played in iteration 0 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:06:25,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:25,601][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:26,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:26,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:27,223][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:27,547][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:28,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:28,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:28,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:29,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:29,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:29,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:30,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:31,470][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:31,799][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:32,450][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:33,103][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:33,437][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:33,770][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:34,097][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:34,425][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:34,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:35,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:35,726][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:36,050][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:36,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:37,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:37,724][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.58%, Current % of VRAM taken: 42.03%, Block Peak % of device VRAM: 25.21%, ΔTime: 00:00:11 [2025-11-13 08:06:38,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:38,372][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:38,374][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:39,499][__main__][INFO] - Iteration 1 took 19s (24.40% Gen, 69.72% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 55m 21s. Estimated total time: 15h 58m 11s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 56s, 500 more iterations: 2h 39m 41s. [2025-11-13 08:06:39,502][__main__][INFO] - Starting iteration 1. [2025-11-13 08:06:39,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:39,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:43,252][__main__][INFO] - Number of regex retries in iteration 1: 0 [2025-11-13 08:06:43,252][__main__][INFO] - agents played in iteration 1 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:06:43,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:43,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:44,484][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:45,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:45,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:45,761][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:46,088][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:46,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:47,392][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:47,718][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:48,043][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:48,367][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:48,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:49,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:49,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:50,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:50,326][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:50,651][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:50,980][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:51,305][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:51,631][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:51,957][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:52,284][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:52,610][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:52,936][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:53,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:53,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:54,565][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:54,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:55,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:06:56,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:56,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:56,271][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:57,523][__main__][INFO] - Iteration 2 took 18s (20.79% Gen, 72.25% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 57m 46s. Estimated total time: 15h 0m 55s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 1s, 500 more iterations: 2h 30m 9s. [2025-11-13 08:06:57,525][__main__][INFO] - Starting iteration 2. [2025-11-13 08:06:57,528][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:57,529][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:01,130][__main__][INFO] - Number of regex retries in iteration 2: 0 [2025-11-13 08:07:01,131][__main__][INFO] - agents played in iteration 2 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:07:01,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,691][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:01,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:02,375][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:03,003][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:03,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:03,980][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:04,307][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:04,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:05,619][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:06,930][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:07,589][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:07,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:08,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:08,574][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:10,556][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:11,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:12,208][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:12,543][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:12,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:13,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:14,296][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:14,297][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:14,299][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:15,304][__main__][INFO] - Iteration 3 took 17s (20.26% Gen, 74.07% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 25s. Estimated total time: 14h 48m 51s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 37s, 500 more iterations: 2h 28m 8s. [2025-11-13 08:07:15,307][__main__][INFO] - Starting iteration 3. [2025-11-13 08:07:15,310][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:15,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:18,928][__main__][INFO] - Number of regex retries in iteration 3: 0 [2025-11-13 08:07:18,928][__main__][INFO] - agents played in iteration 3 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:07:19,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:19,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:20,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:20,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:22,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:22,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:23,392][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:23,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:25,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:25,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:25,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:25,998][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:26,325][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:26,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:26,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:27,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:27,960][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:28,285][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:28,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:28,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:29,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:29,593][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:30,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:31,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:32,010][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:32,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:32,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:33,034][__main__][INFO] - Iteration 4 took 17s (20.41% Gen, 73.82% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 42m 30s. Estimated total time: 14h 46m 14s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 32s, 500 more iterations: 2h 27m 42s. [2025-11-13 08:07:33,036][__main__][INFO] - Starting iteration 4. [2025-11-13 08:07:33,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:33,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:36,732][__main__][INFO] - Number of regex retries in iteration 4: 0 [2025-11-13 08:07:36,732][__main__][INFO] - agents played in iteration 4 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:07:37,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:37,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:37,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:38,269][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:38,924][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:39,578][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:39,904][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:40,232][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:40,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:40,886][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:41,215][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:41,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:41,867][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:42,194][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:42,524][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:42,850][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:43,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:43,828][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:44,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:44,806][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:45,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:45,459][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:45,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:46,764][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:47,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:47,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:47,748][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:48,075][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:48,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:49,077][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:49,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:49,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:49,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:50,826][__main__][INFO] - Iteration 5 took 17s (20.76% Gen, 73.60% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 19s. Estimated total time: 14h 49m 20s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 38s, 500 more iterations: 2h 28m 13s. [2025-11-13 08:07:50,828][__main__][INFO] - Starting iteration 5. [2025-11-13 08:07:50,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:50,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:54,521][__main__][INFO] - Number of regex retries in iteration 5: 0 [2025-11-13 08:07:54,522][__main__][INFO] - agents played in iteration 5 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:07:54,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:55,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:55,751][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:56,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:56,379][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:56,704][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:57,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:57,356][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:57,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:58,334][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:58,661][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:58,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:59,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:59,648][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:00,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:01,286][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:01,611][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:02,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:02,918][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:03,572][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:03,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:04,229][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:04,559][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:04,889][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:05,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:05,541][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:05,871][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:06,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:06,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:07,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:07,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:07,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:08,591][__main__][INFO] - Iteration 6 took 17s (20.77% Gen, 73.75% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 43m 43s. Estimated total time: 14h 48m 2s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 36s, 500 more iterations: 2h 28m 0s. [2025-11-13 08:08:08,594][__main__][INFO] - Starting iteration 6. [2025-11-13 08:08:08,597][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:08,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:12,249][__main__][INFO] - Number of regex retries in iteration 6: 0 [2025-11-13 08:08:12,250][__main__][INFO] - agents played in iteration 6 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:08:12,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:12,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:13,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:13,790][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:14,120][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:14,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:15,427][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:15,754][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:16,082][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:16,408][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:16,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:17,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:17,388][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:18,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:18,696][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:19,024][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:19,682][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:20,009][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:20,335][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:21,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:21,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:22,308][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:22,634][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:22,961][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:23,613][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:23,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:24,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:25,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:25,393][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:25,395][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:26,410][__main__][INFO] - Iteration 7 took 17s (20.50% Gen, 73.79% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 46m 3s. Estimated total time: 14h 50m 40s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 41s, 500 more iterations: 2h 28m 26s. [2025-11-13 08:08:26,412][__main__][INFO] - Starting iteration 7. [2025-11-13 08:08:26,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:26,416][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:30,086][__main__][INFO] - Number of regex retries in iteration 7: 0 [2025-11-13 08:08:30,087][__main__][INFO] - agents played in iteration 7 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:08:30,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:30,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:31,327][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:31,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:32,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:32,611][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:32,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:33,598][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:33,927][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:34,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:34,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:34,926][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:35,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:36,251][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:36,582][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:36,910][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:37,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:37,564][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:37,889][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:38,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:39,193][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:40,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:41,166][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:41,491][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:41,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:42,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:43,230][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:43,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:43,233][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:44,219][__main__][INFO] - Iteration 8 took 17s (20.61% Gen, 73.83% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 20s. Estimated total time: 14h 50m 15s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 40s, 500 more iterations: 2h 28m 22s. [2025-11-13 08:08:44,222][__main__][INFO] - Starting iteration 8. [2025-11-13 08:08:44,225][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:44,226][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:47,916][__main__][INFO] - Number of regex retries in iteration 8: 0 [2025-11-13 08:08:47,917][__main__][INFO] - agents played in iteration 8 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:08:48,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,465][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:48,465][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:49,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:49,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:50,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:50,448][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:50,777][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:51,104][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:51,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:51,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:52,084][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:52,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:52,750][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:53,078][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:53,406][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:53,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:54,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:54,717][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:55,047][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:55,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:55,701][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:56,028][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:56,355][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:56,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:57,337][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:57,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:57,992][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:58,319][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:58,645][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:59,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:00,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:01,064][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:01,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:01,067][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:02,141][__main__][INFO] - Iteration 9 took 17s (20.60% Gen, 73.40% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 50m 35s. Estimated total time: 14h 55m 48s. Time estimates for 10 more iterations: 2m 59s, 100 more iterations: 29m 51s, 500 more iterations: 2h 29m 18s. [2025-11-13 08:09:02,143][__main__][INFO] - Starting iteration 9. [2025-11-13 08:09:02,146][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:02,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:05,829][__main__][INFO] - Number of regex retries in iteration 9: 0 [2025-11-13 08:09:05,830][__main__][INFO] - agents played in iteration 9 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:09:06,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,386][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:06,386][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:07,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:07,715][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:08,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:08,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:08,707][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:09,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:09,705][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:10,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:10,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:11,014][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:11,344][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:11,675][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:12,004][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:13,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:13,650][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:14,313][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:14,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:15,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:15,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:16,940][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:17,268][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:17,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:18,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:19,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:19,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:19,009][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:20,055][__main__][INFO] - Iteration 10 took 17s (20.57% Gen, 73.58% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 49m 57s. Estimated total time: 14h 55m 28s. Time estimates for 10 more iterations: 2m 59s, 100 more iterations: 29m 50s, 500 more iterations: 2h 29m 14s. [2025-11-13 08:09:20,057][__main__][INFO] - Starting iteration 10. [2025-11-13 08:09:20,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:20,061][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:23,703][__main__][INFO] - Number of regex retries in iteration 10: 0 [2025-11-13 08:09:23,704][__main__][INFO] - agents played in iteration 10 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:09:24,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,251][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:24,252][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:24,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:25,252][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:25,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:26,573][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:26,899][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:28,883][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:29,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:29,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:30,210][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:30,536][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:31,852][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:32,512][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:32,840][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:33,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:33,504][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:33,831][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:34,165][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:34,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:34,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:35,150][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:35,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:36,223][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:36,969][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:36,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:36,972][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:39,073][__main__][INFO] - Iteration 11 took 19s (19.16% Gen, 69.78% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 44m 50s. Estimated total time: 15h 50m 40s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 41s, 500 more iterations: 2h 38m 26s. [2025-11-13 08:09:39,075][__main__][INFO] - Starting iteration 11. [2025-11-13 08:09:39,079][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:39,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:43,285][__main__][INFO] - Number of regex retries in iteration 11: 0 [2025-11-13 08:09:43,286][__main__][INFO] - agents played in iteration 11 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:09:43,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:43,829][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:44,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:44,820][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:45,476][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:45,804][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:46,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:46,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:47,441][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:48,099][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:48,434][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:48,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:49,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:49,416][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:49,744][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:50,398][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:50,726][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:51,056][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:51,384][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:51,718][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:52,702][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:53,038][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:53,373][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:53,703][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:54,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:54,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:54,694][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:55,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:55,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:56,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:56,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:56,421][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:57,428][__main__][INFO] - Iteration 12 took 18s (22.92% Gen, 71.58% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 11m 23s. Estimated total time: 15h 17m 31s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 35s, 500 more iterations: 2h 32m 55s. [2025-11-13 08:09:57,431][__main__][INFO] - Starting iteration 12. [2025-11-13 08:09:57,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:57,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:01,395][__main__][INFO] - Number of regex retries in iteration 12: 0 [2025-11-13 08:10:01,395][__main__][INFO] - agents played in iteration 12 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:10:01,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,948][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:01,948][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:02,658][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:03,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:03,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:03,948][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:04,610][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:04,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:05,274][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:05,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:05,928][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:06,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:06,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:08,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:09,216][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:09,544][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:09,872][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:10,199][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:10,526][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:10,857][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:11,848][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:12,511][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:12,839][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:13,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:13,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:14,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:14,641][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:14,643][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:15,655][__main__][INFO] - Iteration 13 took 18s (21.73% Gen, 72.71% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 4m 38s. Estimated total time: 15h 11m 5s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 22s, 500 more iterations: 2h 31m 50s. [2025-11-13 08:10:15,658][__main__][INFO] - Starting iteration 13. [2025-11-13 08:10:15,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:15,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:19,629][__main__][INFO] - Number of regex retries in iteration 13: 0 [2025-11-13 08:10:19,630][__main__][INFO] - agents played in iteration 13 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:10:20,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:20,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:20,896][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:21,522][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:21,851][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:22,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:22,505][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:22,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:23,160][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:23,489][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:23,815][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:24,144][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:24,799][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:25,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:26,111][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:26,767][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:27,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:27,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:28,081][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:28,410][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:28,737][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:29,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:29,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:30,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:30,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:31,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:32,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:32,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:32,787][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:32,788][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:33,762][__main__][INFO] - Iteration 14 took 18s (21.92% Gen, 72.69% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 22s. Estimated total time: 15h 5m 6s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 10s, 500 more iterations: 2h 30m 51s. [2025-11-13 08:10:33,764][__main__][INFO] - Starting iteration 14. [2025-11-13 08:10:33,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:33,768][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:37,879][__main__][INFO] - Number of regex retries in iteration 14: 0 [2025-11-13 08:10:37,880][__main__][INFO] - agents played in iteration 14 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:10:38,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:38,460][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:39,476][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:39,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:40,459][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:40,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:41,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:41,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:42,098][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:42,427][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:42,755][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:44,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:44,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:44,727][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:45,057][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:45,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:45,717][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:46,372][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:47,364][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:47,692][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:48,346][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:48,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:49,002][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:49,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:49,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:50,349][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:51,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:51,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:51,080][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:52,115][__main__][INFO] - Iteration 15 took 18s (22.41% Gen, 71.94% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 10m 22s. Estimated total time: 15h 17m 25s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 34s, 500 more iterations: 2h 32m 54s. [2025-11-13 08:10:52,117][__main__][INFO] - Starting iteration 15. [2025-11-13 08:10:52,120][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:52,121][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:56,128][__main__][INFO] - Number of regex retries in iteration 15: 0 [2025-11-13 08:10:56,129][__main__][INFO] - agents played in iteration 15 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:10:56,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,682][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:56,682][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:57,406][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:57,706][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:58,694][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:59,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:59,351][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:00,007][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:00,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:00,989][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:01,319][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:01,646][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:01,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:02,300][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:02,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:02,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:03,285][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:03,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:04,270][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:04,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:04,926][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:05,581][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:06,243][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:06,573][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:07,228][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:07,557][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:07,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:08,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:09,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:09,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:09,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:10,296][__main__][INFO] - Iteration 16 took 18s (22.05% Gen, 72.58% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 1m 29s. Estimated total time: 15h 8m 50s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 17s, 500 more iterations: 2h 31m 28s. [2025-11-13 08:11:10,300][__main__][INFO] - Starting iteration 16. [2025-11-13 08:11:10,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:10,303][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:14,265][__main__][INFO] - Number of regex retries in iteration 16: 0 [2025-11-13 08:11:14,266][__main__][INFO] - agents played in iteration 16 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:11:14,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:14,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:15,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:15,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:16,163][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:16,820][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:17,148][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:17,476][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:17,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:18,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:18,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:18,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:19,773][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:20,432][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:21,424][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:22,424][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:22,752][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:23,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:23,747][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:24,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:24,406][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:24,738][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:25,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:25,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:25,728][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:26,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:26,756][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:27,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:27,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:27,490][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:28,473][__main__][INFO] - Iteration 17 took 18s (21.81% Gen, 72.77% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 0m 54s. Estimated total time: 15h 8m 33s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 17s, 500 more iterations: 2h 31m 25s. [2025-11-13 08:11:28,475][__main__][INFO] - Starting iteration 17. [2025-11-13 08:11:28,479][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:28,479][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:32,616][__main__][INFO] - Number of regex retries in iteration 17: 0 [2025-11-13 08:11:32,617][__main__][INFO] - agents played in iteration 17 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:11:33,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,165][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:33,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:33,876][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:35,175][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:35,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:35,829][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:36,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:36,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:37,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:37,793][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:38,449][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:38,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:39,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:39,760][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:40,086][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:40,740][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:41,069][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:41,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:41,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:42,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:42,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:42,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:43,035][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:43,362][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:43,689][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:44,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:44,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:45,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:45,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:45,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:45,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:46,804][__main__][INFO] - Iteration 18 took 18s (22.58% Gen, 71.72% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 8m 22s. Estimated total time: 15h 16m 20s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 32s, 500 more iterations: 2h 32m 43s. [2025-11-13 08:11:46,807][__main__][INFO] - Starting iteration 18. [2025-11-13 08:11:46,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:46,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:50,785][__main__][INFO] - Number of regex retries in iteration 18: 0 [2025-11-13 08:11:50,785][__main__][INFO] - agents played in iteration 18 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:11:51,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,342][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:51,343][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:52,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:52,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:52,685][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:53,014][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:53,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:53,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:53,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:54,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:54,979][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:55,306][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:55,633][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:55,967][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:56,293][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:56,622][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:56,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:57,277][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:57,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:58,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:58,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:59,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:59,583][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:00,241][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:00,572][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:00,903][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:01,232][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:01,558][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:01,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:02,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:03,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:03,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:03,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:03,971][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:04,960][__main__][INFO] - Iteration 19 took 18s (21.89% Gen, 72.65% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 59m 16s. Estimated total time: 15h 7m 31s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 15s, 500 more iterations: 2h 31m 15s. [2025-11-13 08:12:04,962][__main__][INFO] - Starting iteration 19. [2025-11-13 08:12:04,966][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:04,966][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:08,922][__main__][INFO] - Number of regex retries in iteration 19: 0 [2025-11-13 08:12:08,923][__main__][INFO] - agents played in iteration 19 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:12:09,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:09,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:11,145][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:11,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:12,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:12,806][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:13,133][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:13,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:13,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:14,124][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:14,778][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:15,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:15,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:16,416][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:17,073][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:17,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:17,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:18,058][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:18,712][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:19,039][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:19,369][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:20,026][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:20,354][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:20,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:21,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:22,114][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:22,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:22,117][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:23,142][__main__][INFO] - Iteration 20 took 18s (21.76% Gen, 72.59% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 0m 17s. Estimated total time: 15h 8m 51s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 17s, 500 more iterations: 2h 31m 28s. [2025-11-13 08:12:23,144][__main__][INFO] - Starting iteration 20. [2025-11-13 08:12:23,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:23,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:27,140][__main__][INFO] - Number of regex retries in iteration 20: 0 [2025-11-13 08:12:27,140][__main__][INFO] - agents played in iteration 20 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:12:27,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,696][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:27,697][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:28,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:28,725][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:29,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:30,033][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:30,361][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:31,014][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:31,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:31,667][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:31,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:32,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:32,979][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:33,632][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:33,959][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:34,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:34,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:34,946][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:35,602][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:36,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:36,582][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:36,909][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:37,238][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:37,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:38,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:38,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:39,586][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:40,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:40,328][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:40,330][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:42,232][__main__][INFO] - Iteration 21 took 19s (20.92% Gen, 69.11% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 45m 23s. Estimated total time: 15h 54m 16s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 48s, 500 more iterations: 2h 39m 2s. [2025-11-13 08:12:42,234][__main__][INFO] - Starting iteration 21. [2025-11-13 08:12:42,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:12:42,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:46,462][__main__][INFO] - Number of regex retries in iteration 21: 0 [2025-11-13 08:12:46,463][__main__][INFO] - agents played in iteration 21 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:12:46,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:47,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:47,018][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:47,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:47,731][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:48,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:48,684][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:49,011][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:49,339][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:49,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:49,995][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:50,651][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:50,979][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:51,309][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:51,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:51,964][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:52,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:52,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:52,944][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:53,598][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:53,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:54,252][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:54,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:55,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:55,897][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:56,552][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:56,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:57,206][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:57,535][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:58,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:58,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:59,601][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:59,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:59,604][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:00,628][__main__][INFO] - Iteration 22 took 18s (22.97% Gen, 71.45% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 10m 21s. Estimated total time: 15h 19m 32s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 39s, 500 more iterations: 2h 33m 15s. [2025-11-13 08:13:00,630][__main__][INFO] - Starting iteration 22. [2025-11-13 08:13:00,634][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:00,635][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:04,567][__main__][INFO] - Number of regex retries in iteration 22: 0 [2025-11-13 08:13:04,567][__main__][INFO] - agents played in iteration 22 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:13:05,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,125][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:05,126][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:05,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:06,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:06,473][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:07,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:08,115][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:08,771][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:09,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:10,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:10,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:10,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:11,073][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:11,402][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:11,728][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:12,054][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:12,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:12,708][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:13,361][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:13,687][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:14,013][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:14,339][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:15,651][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:15,977][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:16,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:16,999][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:17,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:17,726][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:17,728][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:18,721][__main__][INFO] - Iteration 23 took 18s (21.74% Gen, 72.76% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 54m 54s. Estimated total time: 15h 4m 24s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 8s, 500 more iterations: 2h 30m 44s. [2025-11-13 08:13:18,723][__main__][INFO] - Starting iteration 23. [2025-11-13 08:13:18,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:18,727][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:22,711][__main__][INFO] - Number of regex retries in iteration 23: 0 [2025-11-13 08:13:22,711][__main__][INFO] - agents played in iteration 23 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:13:23,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:23,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:24,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:24,299][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:24,628][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:24,957][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:25,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:25,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:25,943][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:26,600][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:26,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:27,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:27,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:27,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:28,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:28,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:29,226][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:29,552][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:29,879][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:30,862][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:31,190][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:31,516][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:32,172][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:32,824][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:33,152][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:33,477][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:33,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:34,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:34,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:35,151][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:35,900][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:35,901][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:35,903][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:36,891][__main__][INFO] - Iteration 24 took 18s (21.93% Gen, 72.62% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 28s. Estimated total time: 15h 8m 16s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 16s, 500 more iterations: 2h 31m 22s. [2025-11-13 08:13:36,893][__main__][INFO] - Starting iteration 24. [2025-11-13 08:13:36,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:36,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:40,986][__main__][INFO] - Number of regex retries in iteration 24: 0 [2025-11-13 08:13:40,987][__main__][INFO] - agents played in iteration 24 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:13:41,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,546][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:41,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:42,574][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:42,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:43,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:43,557][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:43,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:44,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:44,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:44,866][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:45,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:45,520][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:45,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:47,499][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:47,829][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:48,483][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:48,809][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:49,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:49,465][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:49,795][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:50,123][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:50,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:51,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:51,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:52,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:52,430][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:52,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:53,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:54,203][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:54,205][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:54,206][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:55,221][__main__][INFO] - Iteration 25 took 18s (22.32% Gen, 72.14% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 6m 10s. Estimated total time: 15h 16m 16s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 32s, 500 more iterations: 2h 32m 42s. [2025-11-13 08:13:55,223][__main__][INFO] - Starting iteration 25. [2025-11-13 08:13:55,227][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:55,227][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:59,300][__main__][INFO] - Number of regex retries in iteration 25: 0 [2025-11-13 08:13:59,301][__main__][INFO] - agents played in iteration 25 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:13:59,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:59,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:00,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:00,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:01,218][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:01,873][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:02,201][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:02,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:03,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:03,512][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:03,842][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:04,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:04,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:05,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:05,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:05,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:06,160][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:06,494][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:07,814][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:08,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:08,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:09,452][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:09,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:10,445][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:10,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:11,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:11,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:12,538][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:12,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:12,541][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:13,547][__main__][INFO] - Iteration 26 took 18s (22.23% Gen, 72.27% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 5m 37s. Estimated total time: 15h 16m 1s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 32s, 500 more iterations: 2h 32m 40s. [2025-11-13 08:14:13,549][__main__][INFO] - Starting iteration 26. [2025-11-13 08:14:13,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:13,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:17,587][__main__][INFO] - Number of regex retries in iteration 26: 0 [2025-11-13 08:14:17,588][__main__][INFO] - agents played in iteration 26 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:14:18,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,145][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:18,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:18,854][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:19,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:20,137][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:20,466][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:20,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:21,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:21,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:23,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:23,418][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:23,752][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:24,411][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:25,067][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:25,726][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:26,053][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:26,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:27,361][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:27,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:28,015][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:28,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:29,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:30,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:30,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:30,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:30,752][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:31,753][__main__][INFO] - Iteration 27 took 18s (22.17% Gen, 72.32% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 59m 26s. Estimated total time: 15h 10m 9s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 20s, 500 more iterations: 2h 31m 41s. [2025-11-13 08:14:31,756][__main__][INFO] - Starting iteration 27. [2025-11-13 08:14:31,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:31,759][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:35,845][__main__][INFO] - Number of regex retries in iteration 27: 0 [2025-11-13 08:14:35,846][__main__][INFO] - agents played in iteration 27 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:14:36,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,409][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:36,409][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:37,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:37,763][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:38,092][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:38,752][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:39,078][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:39,407][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:40,064][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:40,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:41,055][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:41,717][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:42,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:43,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:44,034][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:44,361][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:44,689][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:45,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:45,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:45,674][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:46,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:46,658][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:46,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:47,323][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:47,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:48,368][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:49,089][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:49,090][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:49,092][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:50,066][__main__][INFO] - Iteration 28 took 18s (22.32% Gen, 72.35% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 4m 24s. Estimated total time: 15h 15m 25s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 30s, 500 more iterations: 2h 32m 34s. [2025-11-13 08:14:50,068][__main__][INFO] - Starting iteration 28. [2025-11-13 08:14:50,072][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:50,072][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:54,153][__main__][INFO] - Number of regex retries in iteration 28: 0 [2025-11-13 08:14:54,154][__main__][INFO] - agents played in iteration 28 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:14:54,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:54,717][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:56,404][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:56,730][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:57,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:57,721][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:59,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:59,685][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:00,012][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:00,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:01,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:01,327][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:01,991][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:02,320][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:02,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:03,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:03,647][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:03,974][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:04,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:04,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:05,287][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:05,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:06,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:07,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:07,393][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:07,395][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:08,446][__main__][INFO] - Iteration 29 took 18s (22.21% Gen, 72.06% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 7m 26s. Estimated total time: 15h 18m 45s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 37s, 500 more iterations: 2h 33m 7s. [2025-11-13 08:15:08,448][__main__][INFO] - Starting iteration 29. [2025-11-13 08:15:08,451][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:08,451][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:12,554][__main__][INFO] - Number of regex retries in iteration 29: 0 [2025-11-13 08:15:12,554][__main__][INFO] - agents played in iteration 29 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:15:12,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:13,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:13,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:14,133][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:14,466][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:14,800][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:15,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:16,449][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:16,779][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:17,763][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:18,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:18,418][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:18,749][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:19,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:19,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:20,062][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:20,397][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:20,723][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:21,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:21,709][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:22,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:22,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:22,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:23,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:23,360][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:24,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:25,055][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:25,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:25,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:25,806][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:26,763][__main__][INFO] - Iteration 30 took 18s (22.40% Gen, 72.36% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 4m 4s. Estimated total time: 15h 15m 41s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 31s, 500 more iterations: 2h 32m 36s. [2025-11-13 08:15:26,765][__main__][INFO] - Starting iteration 30. [2025-11-13 08:15:26,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:26,769][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:30,797][__main__][INFO] - Number of regex retries in iteration 30: 0 [2025-11-13 08:15:30,798][__main__][INFO] - agents played in iteration 30 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:15:31,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:31,369][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:32,088][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:33,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:34,030][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:34,687][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:35,014][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:35,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:35,671][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:35,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:36,327][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:36,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:37,638][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:38,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:38,951][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:39,934][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:40,262][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:40,921][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:41,574][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:41,903][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:42,229][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:42,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:43,264][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:43,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:43,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:43,997][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:45,951][__main__][INFO] - Iteration 31 took 19s (21.00% Gen, 68.81% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 47m 13s. Estimated total time: 15h 59m 10s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 58s, 500 more iterations: 2h 39m 51s. [2025-11-13 08:15:45,953][__main__][INFO] - Starting iteration 31. [2025-11-13 08:15:45,955][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:15:45,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:50,649][__main__][INFO] - Number of regex retries in iteration 31: 0 [2025-11-13 08:15:50,649][__main__][INFO] - agents played in iteration 31 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:15:51,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:51,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:51,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:52,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:52,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:53,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:53,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:54,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:54,857][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:55,185][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:55,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:56,167][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:56,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:57,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:57,478][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:58,133][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:58,786][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:59,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:00,754][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:01,081][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:01,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:01,735][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:02,063][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:02,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:03,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:03,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:03,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:03,851][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:04,850][__main__][INFO] - Iteration 32 took 18s (24.84% Gen, 69.87% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 32m 30s. Estimated total time: 15h 44m 46s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 29s, 500 more iterations: 2h 37m 27s. [2025-11-13 08:16:04,852][__main__][INFO] - Starting iteration 32. [2025-11-13 08:16:04,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:04,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:09,372][__main__][INFO] - Number of regex retries in iteration 32: 0 [2025-11-13 08:16:09,373][__main__][INFO] - agents played in iteration 32 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:16:09,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:09,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:10,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:10,951][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:11,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:11,946][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:12,277][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:12,943][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:13,270][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:13,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:14,587][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:14,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:15,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:15,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:15,908][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:16,243][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:16,577][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:17,230][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:17,890][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:18,553][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:19,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:19,535][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:19,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:20,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:20,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:20,845][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:21,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:21,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:22,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:22,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:22,608][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:23,578][__main__][INFO] - Iteration 33 took 18s (24.12% Gen, 70.69% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 23m 35s. Estimated total time: 15h 36m 10s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 12s, 500 more iterations: 2h 36m 1s. [2025-11-13 08:16:23,580][__main__][INFO] - Starting iteration 33. [2025-11-13 08:16:23,583][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:23,584][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:28,060][__main__][INFO] - Number of regex retries in iteration 33: 0 [2025-11-13 08:16:28,061][__main__][INFO] - agents played in iteration 33 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:16:28,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:28,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:29,639][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:30,622][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:30,949][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:31,275][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:31,930][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:32,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:33,237][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:33,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:34,223][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:34,551][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:34,878][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:35,205][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:35,531][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:35,859][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:36,186][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:36,842][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:37,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:37,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:37,825][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:38,479][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:38,806][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:39,461][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:39,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:40,498][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:41,231][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:41,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:41,234][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:42,297][__main__][INFO] - Iteration 34 took 18s (23.92% Gen, 70.39% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 22m 51s. Estimated total time: 15h 35m 44s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 11s, 500 more iterations: 2h 35m 57s. [2025-11-13 08:16:42,299][__main__][INFO] - Starting iteration 34. [2025-11-13 08:16:42,302][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:42,303][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:46,864][__main__][INFO] - Number of regex retries in iteration 34: 0 [2025-11-13 08:16:46,864][__main__][INFO] - agents played in iteration 34 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:16:47,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:47,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:48,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:48,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:49,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:49,754][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:50,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:50,740][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:51,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:51,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:51,729][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:52,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:53,041][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:53,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:54,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:54,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:54,693][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:55,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:55,347][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:55,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:56,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:56,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:56,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:56,988][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:57,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:57,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:57,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:58,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:58,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:59,336][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:00,078][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:00,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:00,081][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:01,108][__main__][INFO] - Iteration 35 took 18s (24.25% Gen, 70.28% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 27m 8s. Estimated total time: 15h 40m 20s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 20s, 500 more iterations: 2h 36m 43s. [2025-11-13 08:17:01,112][__main__][INFO] - Starting iteration 35. [2025-11-13 08:17:01,115][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:01,116][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:05,651][__main__][INFO] - Number of regex retries in iteration 35: 0 [2025-11-13 08:17:05,651][__main__][INFO] - agents played in iteration 35 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:17:06,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,210][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:06,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:06,931][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:07,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:08,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:08,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:08,879][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:09,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:09,535][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:09,862][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:10,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:10,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:11,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:11,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:11,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:12,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:13,474][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:14,457][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:14,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:15,111][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:15,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:15,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:16,427][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:17,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:17,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:18,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:18,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:18,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:18,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:19,826][__main__][INFO] - Iteration 36 took 18s (24.23% Gen, 70.56% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 22m 5s. Estimated total time: 15h 35m 35s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 11s, 500 more iterations: 2h 35m 55s. [2025-11-13 08:17:19,828][__main__][INFO] - Starting iteration 36. [2025-11-13 08:17:19,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:19,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:21,268][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 08:17:26,115][__main__][INFO] - Number of regex retries in iteration 36: 1 [2025-11-13 08:17:26,116][__main__][INFO] - agents played in iteration 36 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:17:26,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:26,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:26,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:26,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:26,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:26,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:27,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:27,609][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:27,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:28,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:28,586][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:28,911][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:29,568][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:29,895][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:30,219][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:30,543][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:30,870][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:31,193][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:31,519][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:31,846][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:32,175][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:32,501][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:32,829][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:33,482][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:34,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:35,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:35,444][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:35,770][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:36,104][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:36,759][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:37,086][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:37,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:38,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:39,115][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:39,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:39,118][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:40,045][__main__][INFO] - Iteration 37 took 20s (31.09% Gen, 64.32% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 36m 54s. Estimated total time: 16h 50m 45s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 41s, 500 more iterations: 2h 48m 27s. [2025-11-13 08:17:40,047][__main__][INFO] - Starting iteration 37. [2025-11-13 08:17:40,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:40,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:44,477][__main__][INFO] - Number of regex retries in iteration 37: 0 [2025-11-13 08:17:44,478][__main__][INFO] - agents played in iteration 37 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:17:44,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:44,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:44,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:45,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:45,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:45,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:45,657][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:45,953][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:46,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:47,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:47,577][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:47,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:48,226][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:48,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:49,203][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:49,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:49,853][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:50,501][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:51,476][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:51,800][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:52,124][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:52,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:52,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:53,426][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:53,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:54,723][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:55,050][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:55,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:55,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:56,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:56,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:57,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:57,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:57,451][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:58,335][__main__][INFO] - Iteration 38 took 18s (24.21% Gen, 70.95% Train). Generation: 4s, Training: 12s. Estimated remaining time: 15h 0m 7s. Estimated total time: 15h 14m 17s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 28s, 500 more iterations: 2h 32m 22s. [2025-11-13 08:17:58,337][__main__][INFO] - Starting iteration 38. [2025-11-13 08:17:58,340][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:58,340][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:02,750][__main__][INFO] - Number of regex retries in iteration 38: 0 [2025-11-13 08:18:02,751][__main__][INFO] - agents played in iteration 38 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:18:03,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:03,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:03,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:03,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:03,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:03,298][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:03,951][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:04,247][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:04,574][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:05,223][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:05,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:05,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:06,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:06,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:07,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:07,497][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:07,821][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:08,148][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:08,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:09,122][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:09,451][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:09,775][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:10,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:10,748][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:11,072][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:11,398][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:11,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:12,048][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:12,372][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:12,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:13,022][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:13,672][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:13,999][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:14,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:15,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:15,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:15,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:15,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:16,641][__main__][INFO] - Iteration 39 took 18s (24.10% Gen, 71.02% Train). Generation: 4s, Training: 12s. Estimated remaining time: 15h 0m 37s. Estimated total time: 15h 15m 5s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 30s, 500 more iterations: 2h 32m 30s. [2025-11-13 08:18:16,643][__main__][INFO] - Starting iteration 39. [2025-11-13 08:18:16,645][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:16,645][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:21,080][__main__][INFO] - Number of regex retries in iteration 39: 0 [2025-11-13 08:18:21,080][__main__][INFO] - agents played in iteration 39 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:18:21,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:21,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:21,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:21,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:21,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:21,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:22,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:23,583][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:24,233][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:24,559][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:25,533][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:25,857][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:26,183][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:26,831][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:27,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:28,132][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:28,457][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:29,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:30,405][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:30,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:31,054][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:31,703][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:32,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:33,383][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:34,101][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:34,103][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:34,105][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:35,027][__main__][INFO] - Iteration 40 took 18s (24.13% Gen, 70.85% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 4m 24s. Estimated total time: 15h 19m 9s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 38s, 500 more iterations: 2h 33m 11s. [2025-11-13 08:18:35,029][__main__][INFO] - Starting iteration 40. [2025-11-13 08:18:35,032][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:35,032][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:39,559][__main__][INFO] - Number of regex retries in iteration 40: 0 [2025-11-13 08:18:39,559][__main__][INFO] - agents played in iteration 40 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:18:40,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:40,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:40,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:40,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:40,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:40,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:40,796][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:41,093][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:41,419][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:41,743][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:42,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:42,393][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:42,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:43,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:43,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:44,026][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:44,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:45,005][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:45,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:45,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:45,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:46,307][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:47,284][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:47,608][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:47,933][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:48,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:48,583][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:50,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:50,868][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:51,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:51,904][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:52,630][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:52,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:52,634][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:54,630][__main__][INFO] - Iteration 41 took 19s (23.10% Gen, 66.71% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 4m 51s. Estimated total time: 16h 19m 56s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 39s, 500 more iterations: 2h 43m 19s. [2025-11-13 08:18:54,632][__main__][INFO] - Starting iteration 41. [2025-11-13 08:18:54,635][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:18:54,635][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:59,841][__main__][INFO] - Number of regex retries in iteration 41: 0 [2025-11-13 08:18:59,842][__main__][INFO] - agents played in iteration 41 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:19:00,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:00,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:00,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:00,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:00,395][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:00,395][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:01,677][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:02,651][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:02,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:03,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:03,628][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:03,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:04,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:04,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:04,929][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:05,581][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:06,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:06,882][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:07,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:07,530][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:07,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:08,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:08,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:09,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:09,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:09,802][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:10,125][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:10,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:11,099][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:11,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:12,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:12,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:12,851][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:12,853][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:13,775][__main__][INFO] - Iteration 42 took 19s (27.20% Gen, 67.97% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 41m 38s. Estimated total time: 15h 57m 3s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 54s, 500 more iterations: 2h 39m 30s. [2025-11-13 08:19:13,777][__main__][INFO] - Starting iteration 42. [2025-11-13 08:19:13,779][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:13,780][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:18,753][__main__][INFO] - Number of regex retries in iteration 42: 0 [2025-11-13 08:19:18,753][__main__][INFO] - agents played in iteration 42 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:19:19,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:19,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:19,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:19,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:19,306][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:19,307][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:19,987][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:20,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:20,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:21,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:21,913][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:22,237][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:23,209][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:23,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:24,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:24,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:25,155][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:25,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:25,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:26,128][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:26,778][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:27,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:27,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:27,754][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:28,080][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:28,404][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:28,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:29,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:29,384][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:29,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:30,043][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:30,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:31,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:31,795][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:31,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:31,798][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:32,741][__main__][INFO] - Iteration 43 took 18s (26.23% Gen, 68.79% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 32m 23s. Estimated total time: 15h 48m 6s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 36s, 500 more iterations: 2h 38m 1s. [2025-11-13 08:19:32,743][__main__][INFO] - Starting iteration 43. [2025-11-13 08:19:32,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:32,746][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:37,741][__main__][INFO] - Number of regex retries in iteration 43: 0 [2025-11-13 08:19:37,742][__main__][INFO] - agents played in iteration 43 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:19:38,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,293][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:38,293][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:39,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:39,617][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:40,596][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:40,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:41,576][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:41,903][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:42,227][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:42,879][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:43,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:43,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:44,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:44,505][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:45,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:45,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:46,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:47,099][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:47,424][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:48,073][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:48,397][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:48,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:49,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:50,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:50,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:50,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:50,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:51,886][__main__][INFO] - Iteration 44 took 19s (26.10% Gen, 68.24% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 41m 1s. Estimated total time: 15h 57m 3s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 54s, 500 more iterations: 2h 39m 30s. [2025-11-13 08:19:51,888][__main__][INFO] - Starting iteration 44. [2025-11-13 08:19:51,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:51,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:56,888][__main__][INFO] - Number of regex retries in iteration 44: 0 [2025-11-13 08:19:56,889][__main__][INFO] - agents played in iteration 44 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:19:57,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:57,446][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:58,150][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:58,446][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:58,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:59,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:59,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:00,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:00,735][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:01,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:01,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:01,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:02,039][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:02,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:03,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:03,664][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:03,990][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:04,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:05,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:05,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:05,947][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:06,271][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:06,594][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:06,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:07,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:07,894][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:08,219][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:08,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:09,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:09,989][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:09,991][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:09,993][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:10,995][__main__][INFO] - Iteration 45 took 19s (26.16% Gen, 68.58% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 38m 56s. Estimated total time: 15h 55m 17s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 50s, 500 more iterations: 2h 39m 12s. [2025-11-13 08:20:10,997][__main__][INFO] - Starting iteration 45. [2025-11-13 08:20:11,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:11,001][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:15,925][__main__][INFO] - Number of regex retries in iteration 45: 0 [2025-11-13 08:20:15,926][__main__][INFO] - agents played in iteration 45 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:20:16,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:16,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:17,177][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:17,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:18,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:18,771][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:19,097][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:19,421][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:19,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:20,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:20,395][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:20,720][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:21,046][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:21,374][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:21,699][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:22,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:22,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:23,324][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:23,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:23,973][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:24,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:24,946][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:25,269][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:25,599][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:26,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:27,225][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:27,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:28,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:28,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:28,959][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:28,960][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:29,915][__main__][INFO] - Iteration 46 took 18s (26.04% Gen, 68.91% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 29m 8s. Estimated total time: 15h 45m 48s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 31s, 500 more iterations: 2h 37m 38s. [2025-11-13 08:20:29,918][__main__][INFO] - Starting iteration 46. [2025-11-13 08:20:29,921][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:29,921][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:34,833][__main__][INFO] - Number of regex retries in iteration 46: 0 [2025-11-13 08:20:34,834][__main__][INFO] - agents played in iteration 46 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:20:35,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,393][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:35,394][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:36,107][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:36,402][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:37,056][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:37,380][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:38,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:39,008][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:39,656][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:40,307][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:40,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:40,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:41,279][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:41,604][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:42,252][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:42,578][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:42,902][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:43,551][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:43,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:44,854][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:45,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:45,503][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:45,829][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:46,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:46,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:47,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:47,912][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:47,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:47,919][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:49,037][__main__][INFO] - Iteration 47 took 19s (25.70% Gen, 68.44% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 38m 52s. Estimated total time: 15h 55m 52s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 51s, 500 more iterations: 2h 39m 18s. [2025-11-13 08:20:49,040][__main__][INFO] - Starting iteration 47. [2025-11-13 08:20:49,043][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:49,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:54,058][__main__][INFO] - Number of regex retries in iteration 47: 0 [2025-11-13 08:20:54,059][__main__][INFO] - agents played in iteration 47 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:20:54,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,611][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:54,611][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:55,326][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:55,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:55,948][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:56,273][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:56,600][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:56,926][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:57,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:58,226][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:58,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:58,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:59,200][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:59,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:00,174][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:00,499][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:00,823][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:01,148][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:01,474][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:01,798][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:02,449][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:02,779][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:03,103][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:03,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:03,752][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:04,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:04,401][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:04,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:05,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:05,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:06,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:07,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:07,168][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:07,170][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:08,153][__main__][INFO] - Iteration 48 took 19s (26.24% Gen, 68.60% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 38m 14s. Estimated total time: 15h 55m 33s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 51s, 500 more iterations: 2h 39m 15s. [2025-11-13 08:21:08,156][__main__][INFO] - Starting iteration 48. [2025-11-13 08:21:08,159][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:08,159][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:13,244][__main__][INFO] - Number of regex retries in iteration 48: 0 [2025-11-13 08:21:13,244][__main__][INFO] - agents played in iteration 48 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:21:13,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,795][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:13,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:14,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:15,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:15,446][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:15,770][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:16,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:16,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:16,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:17,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:17,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:17,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:18,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:18,373][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:18,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:19,347][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:19,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:19,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:20,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:20,653][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:20,984][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:21,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:22,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:22,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:22,950][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:23,273][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:23,597][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:24,247][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:24,572][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:24,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:25,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:26,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:26,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:26,332][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:27,375][__main__][INFO] - Iteration 49 took 19s (26.46% Gen, 68.11% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 43m 12s. Estimated total time: 16h 0m 51s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 1s, 500 more iterations: 2h 40m 8s. [2025-11-13 08:21:27,377][__main__][INFO] - Starting iteration 49. [2025-11-13 08:21:27,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:27,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:32,388][__main__][INFO] - Number of regex retries in iteration 49: 0 [2025-11-13 08:21:32,389][__main__][INFO] - agents played in iteration 49 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:21:32,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:32,947][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:33,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:33,957][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:34,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:35,589][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:36,241][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:36,566][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:37,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:37,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:38,195][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:39,177][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:39,502][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:39,827][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:40,154][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:40,481][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:40,806][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:41,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:41,791][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:43,096][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:43,421][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:44,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:44,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:45,523][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:45,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:45,527][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:46,499][__main__][INFO] - Iteration 50 took 19s (26.19% Gen, 68.72% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 38m 0s. Estimated total time: 15h 55m 57s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 51s, 500 more iterations: 2h 39m 19s. [2025-11-13 08:21:46,502][__main__][INFO] - Starting iteration 50. [2025-11-13 08:21:46,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:46,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:51,440][__main__][INFO] - Number of regex retries in iteration 50: 0 [2025-11-13 08:21:51,440][__main__][INFO] - agents played in iteration 50 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:21:51,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,996][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:51,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:52,711][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:53,006][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:53,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:53,657][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:53,988][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:54,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:54,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:54,971][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:55,946][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:56,274][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:56,929][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:57,254][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:57,579][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:58,228][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:59,205][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:59,533][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:00,832][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:01,484][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:02,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:02,467][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:02,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:03,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:03,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:04,581][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:04,582][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:04,584][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:06,615][__main__][INFO] - Iteration 51 took 20s (24.54% Gen, 65.36% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 27m 17s. Estimated total time: 16h 45m 34s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 31s, 500 more iterations: 2h 47m 35s. [2025-11-13 08:22:06,617][__main__][INFO] - Starting iteration 51. [2025-11-13 08:22:06,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:06,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:12,008][__main__][INFO] - Number of regex retries in iteration 51: 0 [2025-11-13 08:22:12,008][__main__][INFO] - agents played in iteration 51 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:22:12,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:12,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:13,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:13,565][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:14,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:14,870][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:15,194][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:15,844][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:16,169][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:16,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:16,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:17,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:17,795][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:18,120][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:18,774][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:19,101][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:19,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:20,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:20,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:21,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:21,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:21,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:22,038][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:23,018][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:23,349][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:23,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:24,375][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:25,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:25,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:25,109][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:26,103][__main__][INFO] - Iteration 52 took 19s (27.65% Gen, 67.24% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 55m 34s. Estimated total time: 16h 14m 11s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 28s, 500 more iterations: 2h 42m 21s. [2025-11-13 08:22:26,105][__main__][INFO] - Starting iteration 52. [2025-11-13 08:22:26,108][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:26,109][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:31,428][__main__][INFO] - Number of regex retries in iteration 52: 0 [2025-11-13 08:22:31,429][__main__][INFO] - agents played in iteration 52 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:22:31,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,983][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:31,984][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:33,009][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:33,666][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:35,304][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:35,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:36,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:36,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:36,941][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:37,268][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:37,595][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:37,921][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:38,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:38,571][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:38,896][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:39,547][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:41,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:41,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:42,153][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:42,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:42,801][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:43,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:43,829][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:44,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:44,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:44,555][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:45,577][__main__][INFO] - Iteration 53 took 19s (27.32% Gen, 67.42% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 54m 33s. Estimated total time: 16h 13m 29s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 26s, 500 more iterations: 2h 42m 14s. [2025-11-13 08:22:45,580][__main__][INFO] - Starting iteration 53. [2025-11-13 08:22:45,583][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:45,584][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:50,908][__main__][INFO] - Number of regex retries in iteration 53: 0 [2025-11-13 08:22:50,908][__main__][INFO] - agents played in iteration 53 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:22:51,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:51,466][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:52,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:52,480][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:52,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:53,143][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:53,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:54,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:54,452][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:55,437][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:55,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:57,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:57,729][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:58,053][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:58,707][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:59,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:59,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:00,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:00,664][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:00,990][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:01,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:01,972][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:02,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:02,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:03,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:04,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:04,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:04,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:05,161][__main__][INFO] - Iteration 54 took 19s (27.19% Gen, 67.31% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 59m 39s. Estimated total time: 16h 18m 55s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 37s, 500 more iterations: 2h 43m 9s. [2025-11-13 08:23:05,163][__main__][INFO] - Starting iteration 54. [2025-11-13 08:23:05,166][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:05,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:10,459][__main__][INFO] - Number of regex retries in iteration 54: 0 [2025-11-13 08:23:10,459][__main__][INFO] - agents played in iteration 54 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:23:10,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:11,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:11,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:11,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:12,997][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:13,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:13,646][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:14,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:14,624][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:14,951][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:15,278][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:16,263][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:16,588][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:17,247][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:17,577][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:17,906][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:18,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:18,564][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:18,891][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:19,221][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:19,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:20,525][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:21,181][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:21,832][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:22,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:22,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:23,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:23,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:23,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:24,585][__main__][INFO] - Iteration 55 took 19s (27.25% Gen, 67.66% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 51m 24s. Estimated total time: 16h 10m 59s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 21s, 500 more iterations: 2h 41m 49s. [2025-11-13 08:23:24,588][__main__][INFO] - Starting iteration 55. [2025-11-13 08:23:24,591][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:24,592][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:29,908][__main__][INFO] - Number of regex retries in iteration 55: 0 [2025-11-13 08:23:29,908][__main__][INFO] - agents played in iteration 55 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:23:30,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,455][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:30,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:31,794][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:32,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:32,445][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:32,773][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:33,098][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:33,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:33,751][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:34,076][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:34,401][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:34,728][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:35,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:36,034][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:36,686][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:37,010][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:37,335][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:37,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:37,991][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:38,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:38,652][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:39,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:40,294][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:40,950][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:41,277][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:41,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:42,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:43,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:43,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:43,045][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:44,095][__main__][INFO] - Iteration 56 took 19s (27.25% Gen, 67.35% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 55m 19s. Estimated total time: 16h 15m 14s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 30s, 500 more iterations: 2h 42m 32s. [2025-11-13 08:23:44,097][__main__][INFO] - Starting iteration 56. [2025-11-13 08:23:44,100][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:44,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:49,316][__main__][INFO] - Number of regex retries in iteration 56: 0 [2025-11-13 08:23:49,317][__main__][INFO] - agents played in iteration 56 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:23:49,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:49,879][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:50,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:51,216][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:51,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:51,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:52,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:52,850][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:53,176][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:53,502][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:53,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:54,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:54,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:55,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:56,107][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:57,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:57,736][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:58,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:58,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:59,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:00,016][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:00,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:00,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:01,698][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:02,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:02,418][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:02,420][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:03,390][__main__][INFO] - Iteration 57 took 19s (27.04% Gen, 67.93% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 44m 19s. Estimated total time: 16h 4m 33s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 9s, 500 more iterations: 2h 40m 45s. [2025-11-13 08:24:03,393][__main__][INFO] - Starting iteration 57. [2025-11-13 08:24:03,395][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:03,396][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:08,644][__main__][INFO] - Number of regex retries in iteration 57: 0 [2025-11-13 08:24:08,645][__main__][INFO] - agents played in iteration 57 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:24:09,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:09,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:09,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:10,231][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:11,214][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:11,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:11,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:12,194][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:12,520][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:12,848][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:13,175][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:13,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:14,160][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:14,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:14,815][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:15,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:15,793][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:16,446][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:17,425][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:17,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:18,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:19,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:20,043][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:20,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:21,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:21,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:21,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:21,832][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:22,823][__main__][INFO] - Iteration 58 took 19s (27.01% Gen, 67.87% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 50m 51s. Estimated total time: 16h 11m 25s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 22s, 500 more iterations: 2h 41m 54s. [2025-11-13 08:24:22,826][__main__][INFO] - Starting iteration 58. [2025-11-13 08:24:22,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:22,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:28,125][__main__][INFO] - Number of regex retries in iteration 58: 0 [2025-11-13 08:24:28,126][__main__][INFO] - agents played in iteration 58 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:24:28,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,674][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:28,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:29,389][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:29,684][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:30,009][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:30,335][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:30,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:31,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:31,643][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:32,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:32,945][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:33,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:33,599][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:33,924][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:34,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:34,574][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:35,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:35,556][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:35,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:36,209][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:36,860][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:37,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:37,512][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:37,840][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:38,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:38,825][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:39,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:39,477][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:39,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:40,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:41,231][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:41,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:41,235][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:42,194][__main__][INFO] - Iteration 59 took 19s (27.35% Gen, 67.69% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 47m 25s. Estimated total time: 16h 8m 18s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 16s, 500 more iterations: 2h 41m 23s. [2025-11-13 08:24:42,196][__main__][INFO] - Starting iteration 59. [2025-11-13 08:24:42,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:42,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:47,451][__main__][INFO] - Number of regex retries in iteration 59: 0 [2025-11-13 08:24:47,452][__main__][INFO] - agents played in iteration 59 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:24:47,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:48,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:48,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:48,008][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:48,725][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:49,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:49,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:49,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:49,995][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:50,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:50,971][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:51,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:51,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:51,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:52,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:52,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:52,925][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:54,877][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:55,204][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:55,529][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:55,855][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:56,181][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:56,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:57,157][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:57,483][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:57,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:58,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:58,458][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:58,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:59,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:59,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:00,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:00,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:00,551][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:01,521][__main__][INFO] - Iteration 60 took 19s (27.18% Gen, 67.79% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 44m 55s. Estimated total time: 16h 6m 7s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 12s, 500 more iterations: 2h 41m 1s. [2025-11-13 08:25:01,523][__main__][INFO] - Starting iteration 60. [2025-11-13 08:25:01,526][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:25:01,526][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:06,824][__main__][INFO] - Number of regex retries in iteration 60: 0 [2025-11-13 08:25:06,825][__main__][INFO] - agents played in iteration 60 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:25:07,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,384][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:07,385][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:08,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:09,049][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:09,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:10,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:10,678][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:11,003][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:11,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:11,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:11,981][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:12,308][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:12,961][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:13,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:13,617][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:13,944][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:14,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:14,600][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:14,930][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:15,259][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:15,909][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:16,234][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:16,559][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:16,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:17,537][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:18,191][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:18,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:19,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:19,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:19,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:19,952][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:21,950][__main__][INFO] - Iteration 61 took 20s (25.94% Gen, 64.27% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 39m 43s. Estimated total time: 17h 1m 16s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 2s, 500 more iterations: 2h 50m 12s. [2025-11-13 08:25:21,952][__main__][INFO] - Starting iteration 61. [2025-11-13 08:25:21,955][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:21,955][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:27,708][__main__][INFO] - Number of regex retries in iteration 61: 0 [2025-11-13 08:25:27,709][__main__][INFO] - agents played in iteration 61 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:25:28,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,266][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:28,266][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:28,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:29,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:29,604][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:29,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:30,258][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:30,910][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:31,235][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:31,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:32,209][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:32,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:32,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:33,184][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:33,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:34,163][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:34,489][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:35,141][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:35,797][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:36,449][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:36,776][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:37,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:37,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:37,751][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:38,077][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:38,401][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:38,727][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:39,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:39,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:40,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:40,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:40,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:40,817][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:41,788][__main__][INFO] - Iteration 62 took 19s (29.01% Gen, 66.09% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 9m 48s. Estimated total time: 16h 31m 41s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 3s, 500 more iterations: 2h 45m 16s. [2025-11-13 08:25:41,790][__main__][INFO] - Starting iteration 62. [2025-11-13 08:25:41,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:41,794][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:47,395][__main__][INFO] - Number of regex retries in iteration 62: 0 [2025-11-13 08:25:47,395][__main__][INFO] - agents played in iteration 62 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:25:47,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:47,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:48,662][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:48,957][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:49,611][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:49,937][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:50,923][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:51,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:52,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:52,876][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:53,201][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:53,851][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:54,178][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:54,504][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:54,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:55,152][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:55,477][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:56,129][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:56,454][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:56,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:57,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:57,756][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:58,081][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:58,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:59,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:59,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:00,492][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:00,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:00,495][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:01,463][__main__][INFO] - Iteration 63 took 19s (28.47% Gen, 66.60% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 1m 18s. Estimated total time: 16h 23m 30s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 47s, 500 more iterations: 2h 43m 55s. [2025-11-13 08:26:01,465][__main__][INFO] - Starting iteration 63. [2025-11-13 08:26:01,468][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:01,468][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:07,067][__main__][INFO] - Number of regex retries in iteration 63: 0 [2025-11-13 08:26:07,067][__main__][INFO] - agents played in iteration 63 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:26:07,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:07,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:08,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:08,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:08,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:09,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:09,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:09,932][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:10,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:10,584][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:10,912][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:11,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:11,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:12,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:13,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:14,169][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:14,495][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:15,801][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:16,129][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:16,780][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:17,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:17,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:18,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:18,418][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:18,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:19,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:20,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:20,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:20,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:21,172][__main__][INFO] - Iteration 64 took 19s (28.41% Gen, 66.66% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 2m 43s. Estimated total time: 16h 25m 15s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 50s, 500 more iterations: 2h 44m 12s. [2025-11-13 08:26:21,175][__main__][INFO] - Starting iteration 64. [2025-11-13 08:26:21,178][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:21,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:26,839][__main__][INFO] - Number of regex retries in iteration 64: 0 [2025-11-13 08:26:26,840][__main__][INFO] - agents played in iteration 64 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:26:27,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:27,399][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:28,115][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:28,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:28,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:29,061][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:29,386][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:29,712][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:30,362][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:31,341][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:31,996][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:32,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:32,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:33,300][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:33,953][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:34,278][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:35,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:36,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:36,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:37,215][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:37,543][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:38,195][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:38,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:39,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:39,969][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:39,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:39,972][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:40,946][__main__][INFO] - Iteration 65 took 19s (28.64% Gen, 66.43% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 5m 35s. Estimated total time: 16h 28m 27s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 56s, 500 more iterations: 2h 44m 44s. [2025-11-13 08:26:40,948][__main__][INFO] - Starting iteration 65. [2025-11-13 08:26:40,951][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:40,952][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:46,512][__main__][INFO] - Number of regex retries in iteration 65: 0 [2025-11-13 08:26:46,513][__main__][INFO] - agents played in iteration 65 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:26:46,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,072][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:47,072][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:47,793][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:48,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:48,742][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:49,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:49,393][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:50,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:50,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:51,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:51,681][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:52,011][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:52,343][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:52,673][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:52,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:53,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:53,662][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:53,995][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:54,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:54,650][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:54,978][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:55,309][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:55,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:56,288][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:56,943][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:57,597][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:57,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:58,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:58,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:59,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:59,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:59,733][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:00,708][__main__][INFO] - Iteration 66 took 19s (28.14% Gen, 66.91% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 4m 39s. Estimated total time: 16h 27m 51s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 55s, 500 more iterations: 2h 44m 38s. [2025-11-13 08:27:00,710][__main__][INFO] - Starting iteration 66. [2025-11-13 08:27:00,713][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:00,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:06,314][__main__][INFO] - Number of regex retries in iteration 66: 0 [2025-11-13 08:27:06,315][__main__][INFO] - agents played in iteration 66 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:27:06,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:06,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:07,581][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:08,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:08,528][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:08,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:09,183][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:09,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:10,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:10,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:10,812][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:11,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:11,465][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:11,794][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:12,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:12,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:13,091][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:13,745][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:14,071][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:14,400][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:14,725][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:15,051][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:16,349][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:16,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:17,001][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:17,326][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:17,653][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:17,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:18,705][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:19,442][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:19,444][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:19,445][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:20,416][__main__][INFO] - Iteration 67 took 19s (28.42% Gen, 66.64% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 1m 40s. Estimated total time: 16h 25m 11s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 50s, 500 more iterations: 2h 44m 11s. [2025-11-13 08:27:20,419][__main__][INFO] - Starting iteration 67. [2025-11-13 08:27:20,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:20,422][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:26,011][__main__][INFO] - Number of regex retries in iteration 67: 0 [2025-11-13 08:27:26,012][__main__][INFO] - agents played in iteration 67 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:27:26,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,565][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:26,565][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:27,283][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:27,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:27,906][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:28,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:28,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:28,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:29,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:29,865][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:30,843][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:31,167][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:31,492][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:31,817][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:32,146][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:32,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:32,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:33,449][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:34,102][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:34,427][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:34,754][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:35,080][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:35,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:36,058][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:36,383][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:37,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:37,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:38,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:39,125][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:39,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:39,128][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:40,108][__main__][INFO] - Iteration 68 took 19s (28.39% Gen, 66.62% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 0m 30s. Estimated total time: 16h 24m 21s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 48s, 500 more iterations: 2h 44m 3s. [2025-11-13 08:27:40,111][__main__][INFO] - Starting iteration 68. [2025-11-13 08:27:40,114][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:40,115][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:45,722][__main__][INFO] - Number of regex retries in iteration 68: 0 [2025-11-13 08:27:45,723][__main__][INFO] - agents played in iteration 68 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:27:46,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:46,278][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:46,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:47,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:47,610][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:48,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:48,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:48,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:49,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:49,567][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:49,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:50,214][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:50,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:50,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:51,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:51,842][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:52,166][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:52,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:52,818][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:53,145][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:53,471][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:54,120][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:55,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:55,430][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:56,079][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:56,408][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:56,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:57,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:57,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:58,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:58,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:58,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:58,838][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:59,811][__main__][INFO] - Iteration 69 took 19s (28.47% Gen, 66.58% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 0m 43s. Estimated total time: 16h 24m 54s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 49s, 500 more iterations: 2h 44m 9s. [2025-11-13 08:27:59,814][__main__][INFO] - Starting iteration 69. [2025-11-13 08:27:59,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:59,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:05,353][__main__][INFO] - Number of regex retries in iteration 69: 0 [2025-11-13 08:28:05,354][__main__][INFO] - agents played in iteration 69 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:28:05,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,909][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:05,910][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:06,624][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:06,919][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:07,246][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:08,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:08,544][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:08,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:09,522][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:09,847][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:10,173][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:10,504][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:10,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:11,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:11,480][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:11,805][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:12,786][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:13,763][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:14,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:14,419][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:14,745][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:15,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:15,397][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:15,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:16,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:16,701][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:17,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:17,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:18,471][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:18,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:18,474][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:19,439][__main__][INFO] - Iteration 70 took 19s (28.21% Gen, 66.86% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 56m 40s. Estimated total time: 16h 21m 10s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 42s, 500 more iterations: 2h 43m 31s. [2025-11-13 08:28:19,442][__main__][INFO] - Starting iteration 70. [2025-11-13 08:28:19,444][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:28:19,445][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:25,095][__main__][INFO] - Number of regex retries in iteration 70: 0 [2025-11-13 08:28:25,096][__main__][INFO] - agents played in iteration 70 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:28:25,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,644][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:25,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:26,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:26,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:26,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:27,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:28,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:28,614][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:28,938][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:29,594][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:29,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:30,570][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:30,900][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:31,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:31,551][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:31,877][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:32,528][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:32,855][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:33,181][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:33,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:33,832][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:34,158][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:34,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:35,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:35,466][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:35,793][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:36,117][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:36,441][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:36,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:37,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:38,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:38,236][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:38,238][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:40,127][__main__][INFO] - Iteration 71 took 20s (27.32% Gen, 63.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 49m 20s. Estimated total time: 17h 14m 11s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 28s, 500 more iterations: 2h 52m 21s. [2025-11-13 08:28:40,130][__main__][INFO] - Starting iteration 71. [2025-11-13 08:28:40,133][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:28:40,133][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:46,188][__main__][INFO] - Number of regex retries in iteration 71: 0 [2025-11-13 08:28:46,188][__main__][INFO] - agents played in iteration 71 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:28:46,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,739][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:46,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:47,449][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:48,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:48,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:48,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:49,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:49,376][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:49,703][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:50,355][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:50,679][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:51,008][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:51,333][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:51,660][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:51,985][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:52,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:52,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:52,961][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:53,287][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:53,617][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:53,944][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:54,273][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:54,600][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:55,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:55,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:56,229][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:56,556][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:56,880][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:57,208][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:57,532][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:57,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:58,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:59,302][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:59,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:59,305][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:00,347][__main__][INFO] - Iteration 72 took 20s (29.95% Gen, 64.88% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 25m 35s. Estimated total time: 16h 50m 46s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 41s, 500 more iterations: 2h 48m 27s. [2025-11-13 08:29:00,350][__main__][INFO] - Starting iteration 72. [2025-11-13 08:29:00,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:00,353][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:06,236][__main__][INFO] - Number of regex retries in iteration 72: 0 [2025-11-13 08:29:06,237][__main__][INFO] - agents played in iteration 72 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:29:06,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:06,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:07,801][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:08,127][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:08,453][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:08,781][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:09,757][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:10,084][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:10,412][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:10,736][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:11,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:11,388][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:11,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:12,367][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:12,692][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:13,018][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:13,343][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:13,669][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:13,993][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:14,319][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:15,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:15,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:16,597][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:16,922][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:17,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:17,575][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:17,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:18,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:19,337][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:19,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:19,340][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:20,325][__main__][INFO] - Iteration 73 took 19s (29.46% Gen, 65.60% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 13m 9s. Estimated total time: 16h 38m 40s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 17s, 500 more iterations: 2h 46m 26s. [2025-11-13 08:29:20,327][__main__][INFO] - Starting iteration 73. [2025-11-13 08:29:20,330][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:20,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:26,283][__main__][INFO] - Number of regex retries in iteration 73: 0 [2025-11-13 08:29:26,283][__main__][INFO] - agents played in iteration 73 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:29:26,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,838][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:26,838][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:27,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:27,846][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:28,174][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:28,500][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:28,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:29,154][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:29,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:30,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:30,454][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:31,107][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:31,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:32,735][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:33,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:33,386][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:33,711][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:35,013][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:35,662][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:35,991][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:36,974][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:37,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:37,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:38,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:39,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:39,385][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:39,387][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:40,468][__main__][INFO] - Iteration 74 took 20s (29.56% Gen, 65.07% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 21m 6s. Estimated total time: 16h 46m 58s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 33s, 500 more iterations: 2h 47m 49s. [2025-11-13 08:29:40,471][__main__][INFO] - Starting iteration 74. [2025-11-13 08:29:40,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:40,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:46,301][__main__][INFO] - Number of regex retries in iteration 74: 0 [2025-11-13 08:29:46,302][__main__][INFO] - agents played in iteration 74 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:29:46,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,854][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:46,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:47,563][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:48,186][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:48,512][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:48,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:49,165][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:49,819][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:50,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:50,471][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:51,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:51,777][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:53,404][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:53,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:54,381][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:54,708][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:55,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:55,358][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:55,685][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:56,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:56,338][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:56,992][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:57,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:57,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:58,694][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:59,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:59,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:59,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:00,501][__main__][INFO] - Iteration 75 took 20s (29.10% Gen, 65.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 15m 14s. Estimated total time: 16h 41m 25s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 22s, 500 more iterations: 2h 46m 54s. [2025-11-13 08:30:00,502][__main__][INFO] - Starting iteration 75. [2025-11-13 08:30:00,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:00,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:06,387][__main__][INFO] - Number of regex retries in iteration 75: 0 [2025-11-13 08:30:06,388][__main__][INFO] - agents played in iteration 75 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:30:06,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,948][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:06,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:07,649][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:08,274][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:09,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:09,912][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:10,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:10,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:10,886][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:11,214][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:11,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:12,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:12,523][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:12,848][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:13,172][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:13,499][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:13,823][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:14,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:15,455][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:15,780][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:16,107][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:16,432][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:16,757][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:17,417][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:17,744][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:18,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:18,785][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:19,527][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:19,528][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:19,530][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:20,487][__main__][INFO] - Iteration 76 took 19s (29.44% Gen, 65.77% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 12m 36s. Estimated total time: 16h 39m 7s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 18s, 500 more iterations: 2h 46m 31s. [2025-11-13 08:30:20,489][__main__][INFO] - Starting iteration 76. [2025-11-13 08:30:20,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:20,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:26,383][__main__][INFO] - Number of regex retries in iteration 76: 0 [2025-11-13 08:30:26,384][__main__][INFO] - agents played in iteration 76 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:30:26,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,928][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:26,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:28,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:28,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:28,909][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:29,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:29,565][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:30,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:30,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:30,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:31,204][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:31,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:31,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:32,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:33,161][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:33,486][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:34,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:34,791][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:35,116][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:36,091][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:36,417][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:36,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:37,068][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:37,392][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:38,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:38,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:39,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:39,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:39,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:40,481][__main__][INFO] - Iteration 77 took 19s (29.47% Gen, 65.60% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 12m 41s. Estimated total time: 16h 39m 32s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 19s, 500 more iterations: 2h 46m 35s. [2025-11-13 08:30:40,483][__main__][INFO] - Starting iteration 77. [2025-11-13 08:30:40,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:40,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:46,346][__main__][INFO] - Number of regex retries in iteration 77: 0 [2025-11-13 08:30:46,347][__main__][INFO] - agents played in iteration 77 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:30:46,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:46,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:47,611][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:47,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:48,559][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:48,885][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:49,536][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:50,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:50,522][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:51,179][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:51,835][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:52,485][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:53,790][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:54,115][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:54,442][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:54,768][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:55,094][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:55,744][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:56,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:56,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:57,374][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:58,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:58,721][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:59,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:59,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:59,443][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:00,553][__main__][INFO] - Iteration 78 took 20s (29.20% Gen, 65.26% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 16m 10s. Estimated total time: 16h 43m 21s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 26s, 500 more iterations: 2h 47m 13s. [2025-11-13 08:31:00,555][__main__][INFO] - Starting iteration 78. [2025-11-13 08:31:00,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:00,558][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:06,361][__main__][INFO] - Number of regex retries in iteration 78: 0 [2025-11-13 08:31:06,362][__main__][INFO] - agents played in iteration 78 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:31:06,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:06,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:08,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:09,234][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:09,885][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:10,210][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:10,536][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:10,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:11,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:12,176][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:12,504][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:13,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:13,498][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:13,829][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:14,160][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:14,488][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:14,814][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:15,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:15,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:15,796][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:16,127][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:16,456][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:17,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:17,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:18,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:18,805][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:19,518][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:19,520][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:19,521][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:20,594][__main__][INFO] - Iteration 79 took 20s (28.97% Gen, 65.67% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 14m 21s. Estimated total time: 16h 41m 52s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 23s, 500 more iterations: 2h 46m 58s. [2025-11-13 08:31:20,596][__main__][INFO] - Starting iteration 79. [2025-11-13 08:31:20,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:20,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:26,369][__main__][INFO] - Number of regex retries in iteration 79: 0 [2025-11-13 08:31:26,370][__main__][INFO] - agents played in iteration 79 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:31:26,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:26,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:28,252][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:28,581][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:28,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:29,232][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:29,558][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:30,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:31,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:31,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:32,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:32,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:33,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:33,814][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:34,145][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:34,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:35,446][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:35,771][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:36,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:36,421][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:36,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:37,071][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:37,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:38,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:38,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:39,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:39,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:39,478][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:40,445][__main__][INFO] - Iteration 80 took 19s (29.07% Gen, 66.05% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 4m 32s. Estimated total time: 16h 32m 23s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 4s, 500 more iterations: 2h 45m 23s. [2025-11-13 08:31:40,447][__main__][INFO] - Starting iteration 80. [2025-11-13 08:31:40,450][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:40,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:46,276][__main__][INFO] - Number of regex retries in iteration 80: 0 [2025-11-13 08:31:46,277][__main__][INFO] - agents played in iteration 80 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:31:46,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,830][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:46,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:47,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:48,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:48,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:49,165][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:49,821][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:50,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:50,471][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:51,776][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:53,409][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:53,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:54,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:54,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:55,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:55,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:56,017][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:56,343][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:56,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:56,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:57,321][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:57,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:58,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:59,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:59,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:59,406][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:01,436][__main__][INFO] - Iteration 81 took 20s (27.76% Gen, 62.55% Train). Generation: 5s, Training: 13s. Estimated remaining time: 17h 1m 9s. Estimated total time: 17h 29m 21s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 58s, 500 more iterations: 2h 54m 53s. [2025-11-13 08:32:01,438][__main__][INFO] - Starting iteration 81. [2025-11-13 08:32:01,440][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:01,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:07,699][__main__][INFO] - Number of regex retries in iteration 81: 0 [2025-11-13 08:32:07,699][__main__][INFO] - agents played in iteration 81 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:32:08,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,252][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:08,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:08,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:09,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:09,587][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:10,567][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:10,893][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:11,219][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:11,547][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:11,872][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:12,525][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:12,852][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:13,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:13,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:14,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:14,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:14,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:15,454][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:16,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:17,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:17,418][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:17,744][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:18,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:18,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:18,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:19,048][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:19,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:20,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:20,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:20,819][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:20,821][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:21,805][__main__][INFO] - Iteration 82 took 20s (30.73% Gen, 64.43% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 29m 42s. Estimated total time: 16h 58m 15s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 56s, 500 more iterations: 2h 49m 42s. [2025-11-13 08:32:21,807][__main__][INFO] - Starting iteration 82. [2025-11-13 08:32:21,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:21,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:27,922][__main__][INFO] - Number of regex retries in iteration 82: 0 [2025-11-13 08:32:27,923][__main__][INFO] - agents played in iteration 82 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:32:28,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:28,480][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:29,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:29,811][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:30,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:30,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:31,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:31,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:31,780][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:32,428][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:32,754][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:33,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:34,732][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:35,063][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:35,392][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:35,717][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:36,375][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:36,706][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:37,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:37,686][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:38,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:38,340][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:38,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:38,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:39,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:40,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:41,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:41,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:41,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:42,029][__main__][INFO] - Iteration 83 took 20s (30.23% Gen, 65.10% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 22m 9s. Estimated total time: 16h 51m 2s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 42s, 500 more iterations: 2h 48m 30s. [2025-11-13 08:32:42,031][__main__][INFO] - Starting iteration 83. [2025-11-13 08:32:42,034][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:42,035][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:48,114][__main__][INFO] - Number of regex retries in iteration 83: 0 [2025-11-13 08:32:48,115][__main__][INFO] - agents played in iteration 83 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:32:48,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:48,670][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:49,680][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:50,332][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:50,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:51,309][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:51,635][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:51,962][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:52,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:53,268][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:53,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:53,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:54,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:54,573][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:55,224][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:55,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:56,209][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:56,539][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:56,865][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:57,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:58,500][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:58,825][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:59,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:00,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:01,241][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:01,243][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:01,244][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:02,277][__main__][INFO] - Iteration 84 took 20s (30.03% Gen, 64.86% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 22m 56s. Estimated total time: 16h 52m 9s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 44s, 500 more iterations: 2h 48m 41s. [2025-11-13 08:33:02,279][__main__][INFO] - Starting iteration 84. [2025-11-13 08:33:02,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:02,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:08,320][__main__][INFO] - Number of regex retries in iteration 84: 0 [2025-11-13 08:33:08,321][__main__][INFO] - agents played in iteration 84 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:33:08,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:08,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:09,886][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:10,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:10,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:11,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:11,843][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:12,493][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:12,821][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:13,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:13,473][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:13,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:14,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:14,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:14,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:15,113][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:15,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:16,092][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:17,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:17,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:17,723][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:18,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:18,703][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:19,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:19,362][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:19,693][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:20,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:20,732][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:21,467][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:21,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:21,470][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:22,427][__main__][INFO] - Iteration 85 took 20s (29.97% Gen, 65.27% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 17m 47s. Estimated total time: 16h 47m 20s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 34s, 500 more iterations: 2h 47m 53s. [2025-11-13 08:33:22,430][__main__][INFO] - Starting iteration 85. [2025-11-13 08:33:22,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:22,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:28,602][__main__][INFO] - Number of regex retries in iteration 85: 0 [2025-11-13 08:33:28,603][__main__][INFO] - agents played in iteration 85 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:33:29,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,155][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:29,156][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:29,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:30,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:30,830][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:31,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:31,479][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:32,131][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:32,458][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:33,110][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:33,437][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:33,765][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:34,093][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:34,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:35,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:35,725][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:36,051][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:36,702][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:37,030][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:37,356][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:37,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:38,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:38,338][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:38,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:38,994][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:39,322][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:39,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:39,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:40,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:41,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:41,752][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:41,753][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:41,755][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:42,707][__main__][INFO] - Iteration 86 took 20s (30.42% Gen, 64.87% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 23m 49s. Estimated total time: 16h 53m 43s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 47s, 500 more iterations: 2h 48m 57s. [2025-11-13 08:33:42,709][__main__][INFO] - Starting iteration 86. [2025-11-13 08:33:42,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:42,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:48,835][__main__][INFO] - Number of regex retries in iteration 86: 0 [2025-11-13 08:33:48,835][__main__][INFO] - agents played in iteration 86 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:33:49,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:49,399][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:50,106][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:50,402][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:50,727][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:51,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:51,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:51,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:52,026][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:52,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:53,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:53,327][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:53,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:53,978][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:54,303][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:54,630][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:54,958][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:55,286][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:55,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:55,942][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:56,269][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:56,599][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:56,927][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:57,254][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:57,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:58,238][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:58,568][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:59,225][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:59,555][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:59,882][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:00,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:01,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:01,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:01,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:01,994][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:02,939][__main__][INFO] - Iteration 87 took 20s (30.27% Gen, 65.05% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 21m 8s. Estimated total time: 16h 51m 22s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 42s, 500 more iterations: 2h 48m 33s. [2025-11-13 08:34:02,941][__main__][INFO] - Starting iteration 87. [2025-11-13 08:34:02,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:02,944][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:09,058][__main__][INFO] - Number of regex retries in iteration 87: 0 [2025-11-13 08:34:09,059][__main__][INFO] - agents played in iteration 87 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:34:09,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:09,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:10,331][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:10,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:10,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:11,284][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:11,613][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:11,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:12,272][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:12,599][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:12,926][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:13,251][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:13,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:14,886][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:15,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:15,867][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:16,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:16,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:16,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:17,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:17,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:17,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:18,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:18,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:19,124][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:19,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:19,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:20,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:20,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:20,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:21,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:22,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:22,221][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:22,223][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:23,187][__main__][INFO] - Iteration 88 took 20s (30.20% Gen, 65.03% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 21m 40s. Estimated total time: 16h 52m 14s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 44s, 500 more iterations: 2h 48m 42s. [2025-11-13 08:34:23,189][__main__][INFO] - Starting iteration 88. [2025-11-13 08:34:23,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:23,193][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:29,400][__main__][INFO] - Number of regex retries in iteration 88: 0 [2025-11-13 08:34:29,401][__main__][INFO] - agents played in iteration 88 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:34:29,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:29,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:30,940][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:31,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:31,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:31,918][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:32,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:32,571][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:33,222][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:33,549][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:33,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:34,198][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:34,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:34,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:35,175][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:35,502][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:35,827][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:36,481][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:37,783][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:38,434][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:38,760][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:39,085][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:39,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:40,061][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:40,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:40,711][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:41,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:41,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:42,475][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:42,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:42,479][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:43,414][__main__][INFO] - Iteration 89 took 20s (30.70% Gen, 64.67% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 20m 15s. Estimated total time: 16h 51m 9s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 42s, 500 more iterations: 2h 48m 31s. [2025-11-13 08:34:43,416][__main__][INFO] - Starting iteration 89. [2025-11-13 08:34:43,419][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:43,419][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:49,515][__main__][INFO] - Number of regex retries in iteration 89: 0 [2025-11-13 08:34:49,516][__main__][INFO] - agents played in iteration 89 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:34:49,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:50,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:50,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:50,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:50,071][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:50,071][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:50,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:51,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:51,413][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:51,741][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:52,401][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:53,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:54,037][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:54,692][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:55,021][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:55,348][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:55,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:56,005][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:56,331][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:56,990][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:57,316][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:57,976][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:58,630][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:59,279][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:59,606][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:59,934][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:00,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:00,909][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:01,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:01,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:02,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:02,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:02,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:03,805][__main__][INFO] - Iteration 90 took 20s (29.90% Gen, 64.47% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 28m 4s. Estimated total time: 16h 59m 18s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 58s, 500 more iterations: 2h 49m 53s. [2025-11-13 08:35:03,806][__main__][INFO] - Starting iteration 90. [2025-11-13 08:35:03,809][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:35:03,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:09,950][__main__][INFO] - Number of regex retries in iteration 90: 0 [2025-11-13 08:35:09,951][__main__][INFO] - agents played in iteration 90 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:35:10,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,507][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:10,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:11,211][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:11,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:12,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:12,486][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:12,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:13,137][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:13,463][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:14,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:14,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:15,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:15,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:15,744][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:16,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:16,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:16,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:17,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:17,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:17,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:18,026][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:18,355][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:18,682][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:19,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:19,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:19,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:20,310][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:20,635][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:20,962][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:21,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:21,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:22,330][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:23,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:23,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:23,062][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:24,945][__main__][INFO] - Iteration 91 took 21s (29.05% Gen, 62.03% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 5m 14s. Estimated total time: 17h 36m 50s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 13s, 500 more iterations: 2h 56m 8s. [2025-11-13 08:35:24,947][__main__][INFO] - Starting iteration 91. [2025-11-13 08:35:24,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:24,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:31,511][__main__][INFO] - Number of regex retries in iteration 91: 0 [2025-11-13 08:35:31,512][__main__][INFO] - agents played in iteration 91 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:35:31,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:32,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:32,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:32,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:32,074][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:32,075][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:32,784][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:33,407][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:33,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:34,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:34,713][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:35,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:35,363][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:35,690][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:36,018][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:36,671][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:37,320][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:37,646][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:37,971][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:38,295][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:39,273][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:39,599][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:40,902][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:41,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:41,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:41,878][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:42,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:43,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:43,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:44,615][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:44,616][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:44,618][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:45,613][__main__][INFO] - Iteration 92 took 20s (31.75% Gen, 63.43% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 41m 18s. Estimated total time: 17h 13m 14s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 26s, 500 more iterations: 2h 52m 12s. [2025-11-13 08:35:45,616][__main__][INFO] - Starting iteration 92. [2025-11-13 08:35:45,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:45,619][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:51,980][__main__][INFO] - Number of regex retries in iteration 92: 0 [2025-11-13 08:35:51,981][__main__][INFO] - agents played in iteration 92 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:35:52,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:52,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:53,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:53,561][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:54,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:54,539][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:54,866][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:55,192][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:55,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:56,174][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:56,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:56,828][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:57,481][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:58,133][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:58,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:58,784][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:59,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:59,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:00,089][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:00,415][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:01,069][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:01,395][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:01,720][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:02,045][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:02,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:03,030][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:03,355][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:03,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:04,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:05,130][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:05,131][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:05,133][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:06,174][__main__][INFO] - Iteration 93 took 20s (30.95% Gen, 63.98% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 35m 31s. Estimated total time: 17h 7m 48s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 15s, 500 more iterations: 2h 51m 18s. [2025-11-13 08:36:06,176][__main__][INFO] - Starting iteration 93. [2025-11-13 08:36:06,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:06,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:12,588][__main__][INFO] - Number of regex retries in iteration 93: 0 [2025-11-13 08:36:12,589][__main__][INFO] - agents played in iteration 93 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:36:13,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,140][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:13,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:13,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:14,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:14,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:15,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:15,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:16,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:16,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:16,768][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:17,098][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:17,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:17,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:18,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:18,407][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:18,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:19,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:19,388][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:19,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:20,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:20,689][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:21,015][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:21,341][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:21,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:22,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:22,977][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:23,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:23,959][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:24,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:25,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:25,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:25,769][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:25,771][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:26,717][__main__][INFO] - Iteration 94 took 20s (31.20% Gen, 64.18% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 34m 19s. Estimated total time: 17h 6m 56s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 13s, 500 more iterations: 2h 51m 9s. [2025-11-13 08:36:26,719][__main__][INFO] - Starting iteration 94. [2025-11-13 08:36:26,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:26,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:33,157][__main__][INFO] - Number of regex retries in iteration 94: 0 [2025-11-13 08:36:33,158][__main__][INFO] - agents played in iteration 94 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:36:33,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,724][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:33,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:34,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:34,732][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:35,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:35,390][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:35,717][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:36,374][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:36,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:37,028][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:37,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:37,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:38,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:38,343][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:38,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:38,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:39,316][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:39,645][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:39,977][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:40,306][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:40,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:40,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:41,285][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:41,612][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:42,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:42,590][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:43,239][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:44,549][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:44,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:45,581][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:46,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:46,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:46,304][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:47,244][__main__][INFO] - Iteration 95 took 20s (31.36% Gen, 64.05% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 33m 11s. Estimated total time: 17h 6m 9s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 12s, 500 more iterations: 2h 51m 1s. [2025-11-13 08:36:47,246][__main__][INFO] - Starting iteration 95. [2025-11-13 08:36:47,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:47,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:53,660][__main__][INFO] - Number of regex retries in iteration 95: 0 [2025-11-13 08:36:53,661][__main__][INFO] - agents played in iteration 95 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:36:54,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,237][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:54,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:54,944][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:55,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:55,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:56,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:57,194][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:57,519][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:57,844][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:58,168][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:58,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:58,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:59,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:00,120][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:00,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:01,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:01,431][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:01,757][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:02,083][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:02,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:02,738][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:03,067][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:03,391][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:03,715][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:04,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:05,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:05,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:06,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:06,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:06,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:06,801][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:07,932][__main__][INFO] - Iteration 96 took 20s (31.00% Gen, 63.52% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 40m 55s. Estimated total time: 17h 14m 14s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 28s, 500 more iterations: 2h 52m 22s. [2025-11-13 08:37:07,935][__main__][INFO] - Starting iteration 96. [2025-11-13 08:37:07,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:07,937][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:14,242][__main__][INFO] - Number of regex retries in iteration 96: 0 [2025-11-13 08:37:14,243][__main__][INFO] - agents played in iteration 96 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:37:14,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:14,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:15,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:16,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:16,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:16,791][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:17,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:18,096][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:18,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:18,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:19,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:19,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:20,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:21,024][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:22,005][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:22,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:22,982][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:23,308][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:23,957][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:24,284][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:24,612][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:25,265][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:25,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:26,643][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:27,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:27,382][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:27,384][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:28,337][__main__][INFO] - Iteration 97 took 20s (30.91% Gen, 64.42% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 26m 24s. Estimated total time: 17h 0m 3s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 0s, 500 more iterations: 2h 50m 0s. [2025-11-13 08:37:28,340][__main__][INFO] - Starting iteration 97. [2025-11-13 08:37:28,342][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:28,342][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:34,670][__main__][INFO] - Number of regex retries in iteration 97: 0 [2025-11-13 08:37:34,670][__main__][INFO] - agents played in iteration 97 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:37:35,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:35,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:35,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:35,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:35,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:35,227][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:35,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:36,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:36,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:37,199][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:37,858][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:38,190][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:38,844][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:39,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:39,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:39,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:40,165][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:40,499][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:41,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:41,489][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:42,141][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:42,469][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:43,121][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:43,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:44,097][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:44,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:45,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:45,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:46,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:46,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:47,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:47,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:47,816][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:47,817][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:48,755][__main__][INFO] - Iteration 98 took 20s (31.00% Gen, 64.40% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 26m 43s. Estimated total time: 17h 0m 43s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 1s, 500 more iterations: 2h 50m 7s. [2025-11-13 08:37:48,757][__main__][INFO] - Starting iteration 98. [2025-11-13 08:37:48,760][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:48,760][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:55,159][__main__][INFO] - Number of regex retries in iteration 98: 0 [2025-11-13 08:37:55,160][__main__][INFO] - agents played in iteration 98 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:37:55,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:55,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:56,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:56,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:57,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:57,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:58,050][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:58,705][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:59,034][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:59,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:00,992][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:01,320][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:01,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:01,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:02,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:03,280][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:03,607][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:04,259][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:04,583][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:04,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:05,887][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:06,211][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:06,536][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:06,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:07,568][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:08,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:08,300][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:08,302][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:09,251][__main__][INFO] - Iteration 99 took 20s (31.23% Gen, 64.13% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 30m 15s. Estimated total time: 17h 4m 35s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 9s, 500 more iterations: 2h 50m 45s. [2025-11-13 08:38:09,253][__main__][INFO] - Starting iteration 99. [2025-11-13 08:38:09,256][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:09,258][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:15,646][__main__][INFO] - Number of regex retries in iteration 99: 0 [2025-11-13 08:38:15,647][__main__][INFO] - agents played in iteration 99 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:38:16,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:16,201][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:16,922][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:17,547][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:17,875][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:18,204][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:18,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:18,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:19,183][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:19,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:20,813][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:21,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:21,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:21,794][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:22,119][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:22,447][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:22,773][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:23,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:23,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:23,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:24,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:24,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:25,380][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:25,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:26,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:27,006][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:27,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:28,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:28,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:28,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:28,784][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:29,740][__main__][INFO] - Iteration 100 took 20s (31.19% Gen, 64.13% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 29m 33s. Estimated total time: 17h 4m 14s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 8s, 500 more iterations: 2h 50m 42s. [2025-11-13 08:38:29,742][__main__][INFO] - Starting iteration 100. [2025-11-13 08:38:29,744][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:29,745][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:36,068][__main__][INFO] - Number of regex retries in iteration 100: 0 [2025-11-13 08:38:36,069][__main__][INFO] - agents played in iteration 100 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:38:36,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,626][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:36,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:37,649][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:38,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:38,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:38,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:39,276][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:39,600][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:39,926][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:40,903][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:41,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:41,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:42,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:42,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:43,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:43,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:43,842][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:44,167][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:44,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:44,818][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:45,145][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:45,471][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:45,796][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:46,120][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:46,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:47,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:47,424][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:47,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:48,477][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:49,217][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:49,219][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:49,221][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:51,208][__main__][INFO] - Iteration 101 took 21s (29.46% Gen, 61.27% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 18m 10s. Estimated total time: 17h 53m 12s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 46s, 500 more iterations: 2h 58m 52s. [2025-11-13 08:38:51,210][__main__][INFO] - Starting iteration 101. [2025-11-13 08:38:51,213][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:38:51,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:58,005][__main__][INFO] - Number of regex retries in iteration 101: 0 [2025-11-13 08:38:58,006][__main__][INFO] - agents played in iteration 101 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:38:58,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,553][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:58,554][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:59,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:59,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:59,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:00,227][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:00,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:00,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:01,205][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:01,856][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:02,182][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:02,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:02,834][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:03,164][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:03,494][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:04,148][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:04,474][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:04,800][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:05,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:05,787][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:06,113][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:06,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:06,766][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:07,092][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:07,746][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:08,400][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:08,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:09,379][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:09,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:10,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:11,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:11,168][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:11,170][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:12,161][__main__][INFO] - Iteration 102 took 20s (32.42% Gen, 62.84% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 52m 6s. Estimated total time: 17h 27m 29s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 54s, 500 more iterations: 2h 54m 34s. [2025-11-13 08:39:12,163][__main__][INFO] - Starting iteration 102. [2025-11-13 08:39:12,166][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:12,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:18,884][__main__][INFO] - Number of regex retries in iteration 102: 0 [2025-11-13 08:39:18,885][__main__][INFO] - agents played in iteration 102 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:39:19,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:19,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:20,152][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:20,448][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:20,777][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:21,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:21,753][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:22,082][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:22,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:22,734][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:23,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:23,390][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:23,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:24,370][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:25,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:25,355][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:25,683][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:26,336][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:26,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:26,989][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:27,640][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:27,965][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:28,290][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:28,615][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:28,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:29,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:29,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:30,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:31,272][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:32,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:32,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:32,009][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:32,973][__main__][INFO] - Iteration 103 took 20s (32.29% Gen, 63.07% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 44m 39s. Estimated total time: 17h 20m 23s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 40s, 500 more iterations: 2h 53m 23s. [2025-11-13 08:39:32,976][__main__][INFO] - Starting iteration 103. [2025-11-13 08:39:32,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:32,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:39,755][__main__][INFO] - Number of regex retries in iteration 103: 0 [2025-11-13 08:39:39,756][__main__][INFO] - agents played in iteration 103 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:39:40,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,315][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:40,315][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:41,030][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:41,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:42,625][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:42,951][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:43,279][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:43,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:43,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:44,582][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:44,907][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:45,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:45,892][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:46,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:46,545][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:46,869][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:47,194][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:48,174][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:48,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:48,832][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:49,160][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:49,487][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:49,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:50,466][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:50,793][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:51,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:52,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:52,914][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:52,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:52,917][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:53,858][__main__][INFO] - Iteration 104 took 20s (32.45% Gen, 63.03% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 55s. Estimated total time: 17h 24m 0s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 48s, 500 more iterations: 2h 54m 0s. [2025-11-13 08:39:53,860][__main__][INFO] - Starting iteration 104. [2025-11-13 08:39:53,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:53,863][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:00,560][__main__][INFO] - Number of regex retries in iteration 104: 0 [2025-11-13 08:40:00,561][__main__][INFO] - agents played in iteration 104 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:40:01,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:01,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:02,127][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:02,457][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:03,775][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:04,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:06,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:06,395][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:06,722][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:07,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:08,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:08,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:08,673][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:08,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:09,322][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:09,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:09,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:10,623][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:10,949][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:11,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:11,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:12,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:12,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:13,692][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:13,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:13,695][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:14,665][__main__][INFO] - Iteration 105 took 20s (32.20% Gen, 63.13% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 43m 44s. Estimated total time: 17h 20m 10s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 40s, 500 more iterations: 2h 53m 21s. [2025-11-13 08:40:14,667][__main__][INFO] - Starting iteration 105. [2025-11-13 08:40:14,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:14,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:21,297][__main__][INFO] - Number of regex retries in iteration 105: 0 [2025-11-13 08:40:21,297][__main__][INFO] - agents played in iteration 105 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:40:21,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:21,855][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:22,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:22,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:23,198][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:23,524][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:23,850][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:24,180][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:24,504][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:25,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:25,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:25,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:26,130][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:26,455][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:26,779][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:27,104][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:27,429][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:27,753][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:28,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:28,404][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:28,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:29,056][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:29,381][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:29,705][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:30,353][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:31,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:31,333][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:31,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:31,983][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:32,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:32,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:33,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:34,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:34,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:34,410][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:35,372][__main__][INFO] - Iteration 106 took 20s (32.01% Gen, 63.34% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 38m 19s. Estimated total time: 17h 15m 5s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 30s, 500 more iterations: 2h 52m 30s. [2025-11-13 08:40:35,374][__main__][INFO] - Starting iteration 106. [2025-11-13 08:40:35,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:35,377][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:42,048][__main__][INFO] - Number of regex retries in iteration 106: 0 [2025-11-13 08:40:42,049][__main__][INFO] - agents played in iteration 106 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:40:42,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,605][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:42,606][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:43,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:43,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:44,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:44,596][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:44,920][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:45,246][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:45,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:46,226][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:46,553][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:46,878][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:47,532][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:47,857][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:48,506][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:48,832][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:49,156][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:49,482][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:50,131][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:50,455][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:51,105][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:51,434][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:52,085][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:52,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:52,734][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:53,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:53,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:54,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:55,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:55,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:55,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:56,092][__main__][INFO] - Iteration 107 took 20s (32.20% Gen, 63.17% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 38m 41s. Estimated total time: 17h 15m 48s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 31s, 500 more iterations: 2h 52m 38s. [2025-11-13 08:40:56,094][__main__][INFO] - Starting iteration 107. [2025-11-13 08:40:56,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:56,098][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:02,805][__main__][INFO] - Number of regex retries in iteration 107: 0 [2025-11-13 08:41:02,805][__main__][INFO] - agents played in iteration 107 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:41:03,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:03,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:04,377][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:04,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:05,357][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:06,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:06,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:06,987][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:07,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:07,966][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:08,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:08,941][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:09,268][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:10,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:10,565][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:10,890][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:11,542][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:11,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:12,518][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:12,846][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:13,496][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:13,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:14,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:14,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:15,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:15,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:15,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:15,932][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:16,905][__main__][INFO] - Iteration 108 took 20s (32.23% Gen, 63.08% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 42m 58s. Estimated total time: 17h 20m 26s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 40s, 500 more iterations: 2h 53m 24s. [2025-11-13 08:41:16,907][__main__][INFO] - Starting iteration 108. [2025-11-13 08:41:16,911][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:16,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:23,639][__main__][INFO] - Number of regex retries in iteration 108: 0 [2025-11-13 08:41:23,640][__main__][INFO] - agents played in iteration 108 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:41:24,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:24,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:24,928][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:25,223][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:25,552][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:25,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:26,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:26,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:27,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:28,175][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:28,502][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:28,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:29,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:29,477][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:29,803][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:30,128][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:30,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:31,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:32,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:32,416][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:32,740][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:33,389][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:34,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:35,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:35,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:36,167][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:36,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:36,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:36,888][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:37,839][__main__][INFO] - Iteration 109 took 20s (32.15% Gen, 63.30% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 48m 37s. Estimated total time: 17h 26m 26s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 52s, 500 more iterations: 2h 54m 24s. [2025-11-13 08:41:37,841][__main__][INFO] - Starting iteration 109. [2025-11-13 08:41:37,844][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:37,844][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:41,182][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 08:41:44,743][__main__][INFO] - Number of regex retries in iteration 109: 1 [2025-11-13 08:41:44,744][__main__][INFO] - agents played in iteration 109 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:41:45,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,292][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:45,293][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:45,998][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:46,293][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:46,945][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:47,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:48,248][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:48,576][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:48,901][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:49,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:49,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:50,207][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:50,533][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:51,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:51,510][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:51,836][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:53,791][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:54,118][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:54,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:55,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:55,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:56,074][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:56,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:57,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:57,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:57,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:57,832][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:58,850][__main__][INFO] - Iteration 110 took 21s (32.84% Gen, 62.30% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 52m 11s. Estimated total time: 17h 30m 21s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 0s, 500 more iterations: 2h 55m 3s. [2025-11-13 08:41:58,852][__main__][INFO] - Starting iteration 110. [2025-11-13 08:41:58,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:58,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:05,396][__main__][INFO] - Number of regex retries in iteration 110: 0 [2025-11-13 08:42:05,396][__main__][INFO] - agents played in iteration 110 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:42:05,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,958][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:05,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:06,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:06,977][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:07,303][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:08,280][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:08,605][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:09,256][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:09,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:09,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:10,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:10,896][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:11,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:11,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:12,533][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:12,858][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:13,189][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:13,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:14,168][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:14,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:14,819][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:15,472][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:15,796][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:16,121][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:16,448][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:16,776][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:17,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:17,817][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:18,540][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:18,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:18,543][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:20,425][__main__][INFO] - Iteration 111 took 21s (30.32% Gen, 60.95% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 20m 1s. Estimated total time: 17h 58m 32s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 57s, 500 more iterations: 2h 59m 45s. [2025-11-13 08:42:20,427][__main__][INFO] - Starting iteration 111. [2025-11-13 08:42:20,430][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:20,430][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:27,620][__main__][INFO] - Number of regex retries in iteration 111: 0 [2025-11-13 08:42:27,621][__main__][INFO] - agents played in iteration 111 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:42:28,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:28,174][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:29,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:29,511][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:30,163][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:31,139][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:31,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:32,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:32,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:32,773][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:33,097][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:33,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:33,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:34,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:34,725][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:35,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:36,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:36,357][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:36,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:37,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:37,336][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:37,663][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:38,316][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:38,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:39,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:40,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:40,738][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:40,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:40,741][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:41,720][__main__][INFO] - Iteration 112 took 21s (33.77% Gen, 61.62% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 5m 39s. Estimated total time: 17h 44m 31s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 29s, 500 more iterations: 2h 57m 25s. [2025-11-13 08:42:41,722][__main__][INFO] - Starting iteration 112. [2025-11-13 08:42:41,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:41,726][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:48,698][__main__][INFO] - Number of regex retries in iteration 112: 0 [2025-11-13 08:42:48,699][__main__][INFO] - agents played in iteration 112 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:42:49,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:49,256][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:49,973][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:50,269][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:50,595][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:50,919][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:51,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:52,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:52,872][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:53,196][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:53,522][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:53,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:54,175][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:54,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:54,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:55,162][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:55,492][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:56,471][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:56,797][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:57,122][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:57,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:58,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:58,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:58,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:59,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:59,730][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:00,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:01,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:01,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:01,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:01,832][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:02,797][__main__][INFO] - Iteration 113 took 21s (33.09% Gen, 62.32% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 54m 25s. Estimated total time: 17h 33m 38s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 7s, 500 more iterations: 2h 55m 36s. [2025-11-13 08:43:02,800][__main__][INFO] - Starting iteration 113. [2025-11-13 08:43:02,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:02,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:09,700][__main__][INFO] - Number of regex retries in iteration 113: 0 [2025-11-13 08:43:09,701][__main__][INFO] - agents played in iteration 113 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:43:10,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:10,259][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:11,295][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:11,625][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:11,950][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:12,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:13,589][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:13,915][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:14,240][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:14,893][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:15,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:15,873][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:16,201][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:16,525][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:16,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:17,176][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:18,154][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:18,479][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:19,130][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:19,455][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:19,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:20,106][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:21,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:21,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:22,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:22,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:22,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:22,901][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:23,887][__main__][INFO] - Iteration 114 took 21s (32.71% Gen, 62.61% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 54m 39s. Estimated total time: 17h 34m 13s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 8s, 500 more iterations: 2h 55m 42s. [2025-11-13 08:43:23,889][__main__][INFO] - Starting iteration 114. [2025-11-13 08:43:23,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:23,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:30,837][__main__][INFO] - Number of regex retries in iteration 114: 0 [2025-11-13 08:43:30,838][__main__][INFO] - agents played in iteration 114 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:43:31,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,393][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:31,393][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:32,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:32,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:33,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:33,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:34,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:34,389][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:34,718][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:35,046][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:35,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:35,704][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:36,032][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:36,686][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:37,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:37,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:38,003][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:38,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:38,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:39,311][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:39,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:40,300][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:40,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:41,281][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:41,606][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:41,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:42,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:43,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:44,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:44,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:44,030][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:44,986][__main__][INFO] - Iteration 115 took 21s (32.92% Gen, 62.54% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 54m 49s. Estimated total time: 17h 34m 45s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 9s, 500 more iterations: 2h 55m 47s. [2025-11-13 08:43:44,988][__main__][INFO] - Starting iteration 115. [2025-11-13 08:43:44,990][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:44,991][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:51,958][__main__][INFO] - Number of regex retries in iteration 115: 0 [2025-11-13 08:43:51,959][__main__][INFO] - agents played in iteration 115 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:43:52,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:52,522][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:53,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:54,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:54,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:54,856][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:55,184][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:55,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:55,836][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:56,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:56,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:56,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:57,136][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:57,462][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:57,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:58,112][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:58,437][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:58,762][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:59,087][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:59,411][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:59,735][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:00,714][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:01,038][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:01,367][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:02,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:02,672][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:03,002][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:03,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:04,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:05,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:05,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:05,115][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:06,081][__main__][INFO] - Iteration 116 took 21s (33.04% Gen, 62.37% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 54m 19s. Estimated total time: 17h 34m 36s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 9s, 500 more iterations: 2h 55m 46s. [2025-11-13 08:44:06,084][__main__][INFO] - Starting iteration 116. [2025-11-13 08:44:06,086][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:06,087][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:13,032][__main__][INFO] - Number of regex retries in iteration 116: 0 [2025-11-13 08:44:13,033][__main__][INFO] - agents played in iteration 116 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:44:13,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:13,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:14,289][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:14,585][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:14,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:15,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:15,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:17,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:18,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:18,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:19,152][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:19,479][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:20,457][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:21,441][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:21,767][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:22,091][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:22,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:22,742][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:23,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:24,057][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:24,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:25,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:26,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:26,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:26,154][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:27,095][__main__][INFO] - Iteration 117 took 21s (33.06% Gen, 62.45% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 49m 50s. Estimated total time: 17h 30m 28s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 0s, 500 more iterations: 2h 55m 4s. [2025-11-13 08:44:27,097][__main__][INFO] - Starting iteration 117. [2025-11-13 08:44:27,099][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:27,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:34,008][__main__][INFO] - Number of regex retries in iteration 117: 0 [2025-11-13 08:44:34,009][__main__][INFO] - agents played in iteration 117 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:44:34,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:34,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:35,277][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:35,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:35,899][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:36,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:36,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:36,883][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:37,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:37,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:37,866][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:39,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:40,153][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:40,480][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:40,805][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:41,132][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:41,457][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:42,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:42,435][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:42,766][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:43,090][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:43,746][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:44,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:45,049][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:45,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:46,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:47,146][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:47,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:47,149][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:48,088][__main__][INFO] - Iteration 118 took 20s (32.91% Gen, 62.61% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 48m 28s. Estimated total time: 17h 29m 27s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 58s, 500 more iterations: 2h 54m 54s. [2025-11-13 08:44:48,090][__main__][INFO] - Starting iteration 118. [2025-11-13 08:44:48,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:48,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:54,963][__main__][INFO] - Number of regex retries in iteration 118: 0 [2025-11-13 08:44:54,964][__main__][INFO] - agents played in iteration 118 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:44:55,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:55,518][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:56,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:56,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:56,865][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:57,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:57,845][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:58,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:58,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:59,474][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:00,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:00,783][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:01,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:01,769][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:02,424][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:02,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:03,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:03,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:04,055][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:04,709][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:05,035][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:05,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:06,340][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:06,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:07,381][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:08,108][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:08,109][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:08,111][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:09,050][__main__][INFO] - Iteration 119 took 20s (32.78% Gen, 62.73% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 46m 35s. Estimated total time: 17h 27m 55s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 55s, 500 more iterations: 2h 54m 39s. [2025-11-13 08:45:09,052][__main__][INFO] - Starting iteration 119. [2025-11-13 08:45:09,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:09,055][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:15,896][__main__][INFO] - Number of regex retries in iteration 119: 0 [2025-11-13 08:45:15,897][__main__][INFO] - agents played in iteration 119 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:45:16,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:16,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:17,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:17,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:18,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:18,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:19,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:19,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:19,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:20,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:20,745][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:21,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:21,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:22,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:22,377][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:23,031][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:23,682][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:24,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:24,986][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:25,312][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:25,637][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:25,964][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:26,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:26,942][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:27,268][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:27,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:28,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:29,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:29,062][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:29,064][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:30,030][__main__][INFO] - Iteration 120 took 20s (32.61% Gen, 62.77% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 9s. Estimated total time: 17h 28m 49s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 57s, 500 more iterations: 2h 54m 48s. [2025-11-13 08:45:30,032][__main__][INFO] - Starting iteration 120. [2025-11-13 08:45:30,035][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:30,036][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:36,995][__main__][INFO] - Number of regex retries in iteration 120: 0 [2025-11-13 08:45:36,996][__main__][INFO] - agents played in iteration 120 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:45:37,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:37,551][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:38,269][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:38,566][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:38,893][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:39,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:39,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:39,869][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:40,196][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:40,523][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:40,848][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:41,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:41,502][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:42,481][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:43,135][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:43,461][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:44,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:44,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:44,769][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:45,096][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:46,729][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:47,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:47,382][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:47,711][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:48,044][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:48,370][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:48,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:49,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:50,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:50,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:50,168][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:52,077][__main__][INFO] - Iteration 121 took 22s (31.57% Gen, 59.76% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 40m 4s. Estimated total time: 18h 22m 7s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 44s, 500 more iterations: 3h 3m 41s. [2025-11-13 08:45:52,079][__main__][INFO] - Starting iteration 121. [2025-11-13 08:45:52,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:45:52,083][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:59,640][__main__][INFO] - Number of regex retries in iteration 121: 0 [2025-11-13 08:45:59,641][__main__][INFO] - agents played in iteration 121 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:46:00,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,195][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:00,195][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:00,899][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:01,523][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:02,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:02,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:03,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:03,488][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:03,815][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:04,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:04,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:04,792][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:05,121][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:05,774][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:06,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:06,428][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:06,754][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:07,408][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:07,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:08,062][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:08,388][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:09,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:09,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:09,694][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:10,346][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:10,674][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:11,000][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:11,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:12,032][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:12,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:12,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:12,770][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:13,728][__main__][INFO] - Iteration 122 took 21s (34.91% Gen, 60.66% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 19m 52s. Estimated total time: 18h 2m 16s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 4s, 500 more iterations: 3h 0m 22s. [2025-11-13 08:46:13,729][__main__][INFO] - Starting iteration 122. [2025-11-13 08:46:13,732][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:13,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:21,005][__main__][INFO] - Number of regex retries in iteration 122: 0 [2025-11-13 08:46:21,006][__main__][INFO] - agents played in iteration 122 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:46:21,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:21,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:22,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:22,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:23,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:23,553][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:23,880][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:24,533][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:25,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:26,165][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:26,492][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:26,820][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:27,151][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:27,478][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:27,804][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:28,129][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:28,454][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:28,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:29,106][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:29,761][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:30,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:30,417][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:30,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:31,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:31,405][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:31,732][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:32,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:32,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:33,432][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:34,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:34,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:34,169][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:35,116][__main__][INFO] - Iteration 123 took 21s (34.01% Gen, 61.55% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 6m 28s. Estimated total time: 17h 49m 14s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 38s, 500 more iterations: 2h 58m 12s. [2025-11-13 08:46:35,119][__main__][INFO] - Starting iteration 123. [2025-11-13 08:46:35,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:35,122][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:42,312][__main__][INFO] - Number of regex retries in iteration 123: 0 [2025-11-13 08:46:42,312][__main__][INFO] - agents played in iteration 123 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:46:42,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:42,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:43,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:44,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:44,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:45,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:46,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:47,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:47,486][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:47,811][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:48,137][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:48,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:48,796][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:49,129][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:49,457][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:49,783][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:50,108][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:50,438][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:50,764][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:51,090][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:51,414][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:51,740][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:52,066][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:53,052][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:54,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:54,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:55,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:55,465][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:55,467][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:56,437][__main__][INFO] - Iteration 124 took 21s (33.73% Gen, 61.71% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 2m 40s. Estimated total time: 17h 45m 48s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 31s, 500 more iterations: 2h 57m 38s. [2025-11-13 08:46:56,439][__main__][INFO] - Starting iteration 124. [2025-11-13 08:46:56,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:56,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:03,831][__main__][INFO] - Number of regex retries in iteration 124: 0 [2025-11-13 08:47:03,831][__main__][INFO] - agents played in iteration 124 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:47:04,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,383][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:04,383][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:05,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:05,392][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:06,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:07,026][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:07,351][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:07,678][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:08,003][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:08,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:08,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:09,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:09,961][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:10,286][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:10,613][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:10,938][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:11,264][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:11,589][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:12,893][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:13,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:14,519][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:15,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:15,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:16,211][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:16,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:16,939][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:16,941][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:17,942][__main__][INFO] - Iteration 125 took 21s (34.37% Gen, 60.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 11m 33s. Estimated total time: 17h 55m 2s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 50s, 500 more iterations: 2h 59m 10s. [2025-11-13 08:47:17,945][__main__][INFO] - Starting iteration 125. [2025-11-13 08:47:17,948][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:17,948][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:24,782][__main__][INFO] - Number of regex retries in iteration 125: 0 [2025-11-13 08:47:24,783][__main__][INFO] - agents played in iteration 125 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:47:25,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:25,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:26,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:26,692][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:27,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:27,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:28,981][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:29,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:30,298][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:30,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:31,618][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:32,275][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:32,926][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:33,254][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:33,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:34,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:34,564][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:34,891][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:35,221][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:35,546][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:35,873][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:36,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:36,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:37,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:37,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:37,976][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:37,978][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:38,952][__main__][INFO] - Iteration 126 took 21s (32.53% Gen, 62.82% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 46m 26s. Estimated total time: 17h 30m 16s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 0s, 500 more iterations: 2h 55m 2s. [2025-11-13 08:47:38,954][__main__][INFO] - Starting iteration 126. [2025-11-13 08:47:38,958][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:38,958][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:46,205][__main__][INFO] - Number of regex retries in iteration 126: 0 [2025-11-13 08:47:46,205][__main__][INFO] - agents played in iteration 126 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:47:46,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:46,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:47,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:47,784][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:48,436][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:48,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:49,085][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:49,411][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:49,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:51,365][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:51,690][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:52,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:52,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:52,666][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:52,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:53,315][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:54,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:54,619][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:54,944][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:55,270][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:55,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:56,248][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:56,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:56,899][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:57,227][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:57,552][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:57,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:58,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:59,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:59,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:59,346][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:00,345][__main__][INFO] - Iteration 127 took 21s (33.88% Gen, 61.44% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 5m 12s. Estimated total time: 17h 49m 24s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 38s, 500 more iterations: 2h 58m 14s. [2025-11-13 08:48:00,347][__main__][INFO] - Starting iteration 127. [2025-11-13 08:48:00,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:00,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:07,569][__main__][INFO] - Number of regex retries in iteration 127: 0 [2025-11-13 08:48:07,570][__main__][INFO] - agents played in iteration 127 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:48:08,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,128][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:08,129][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:08,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:09,142][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:09,469][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:09,797][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:10,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:11,437][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:11,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:12,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:12,412][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:12,737][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:13,390][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:13,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:14,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:14,366][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:14,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:15,019][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:15,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:15,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:16,648][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:16,974][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:17,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:17,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:18,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:18,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:19,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:19,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:20,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:20,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:20,704][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:21,668][__main__][INFO] - Iteration 128 took 21s (33.86% Gen, 61.61% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 1m 23s. Estimated total time: 17h 45m 56s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 31s, 500 more iterations: 2h 57m 39s. [2025-11-13 08:48:21,670][__main__][INFO] - Starting iteration 128. [2025-11-13 08:48:21,673][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:21,674][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:29,199][__main__][INFO] - Number of regex retries in iteration 128: 0 [2025-11-13 08:48:29,200][__main__][INFO] - agents played in iteration 128 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:48:29,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:29,769][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:30,791][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:31,117][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:31,768][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:32,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:33,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:33,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:34,050][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:34,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:34,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:35,027][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:35,355][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:36,010][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:36,665][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:36,992][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:37,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:37,648][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:38,299][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:38,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:39,279][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:39,604][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:39,935][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:40,266][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:40,592][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:40,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:41,636][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:42,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:42,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:42,379][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:43,330][__main__][INFO] - Iteration 129 took 21s (34.75% Gen, 60.85% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 17m 58s. Estimated total time: 18h 2m 52s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 5s, 500 more iterations: 3h 0m 28s. [2025-11-13 08:48:43,332][__main__][INFO] - Starting iteration 129. [2025-11-13 08:48:43,335][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:43,335][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:50,530][__main__][INFO] - Number of regex retries in iteration 129: 0 [2025-11-13 08:48:50,531][__main__][INFO] - agents played in iteration 129 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:48:50,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:51,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:51,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:52,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:52,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:53,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:53,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:54,065][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:54,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:54,718][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:55,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:56,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:56,349][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:56,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:57,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:57,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:57,989][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:58,315][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:58,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:59,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:00,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:01,250][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:01,576][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:01,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:02,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:02,947][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:03,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:03,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:03,675][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:04,698][__main__][INFO] - Iteration 130 took 21s (33.68% Gen, 61.53% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 2m 54s. Estimated total time: 17h 48m 10s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 36s, 500 more iterations: 2h 58m 1s. [2025-11-13 08:49:04,699][__main__][INFO] - Starting iteration 130. [2025-11-13 08:49:04,702][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:49:04,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:11,797][__main__][INFO] - Number of regex retries in iteration 130: 0 [2025-11-13 08:49:11,798][__main__][INFO] - agents played in iteration 130 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:49:12,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,355][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:12,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:13,693][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:14,020][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:14,346][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:14,676][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:15,005][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:15,659][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:15,984][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:16,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:16,638][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:16,964][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:18,270][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:19,252][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:19,577][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:19,904][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:20,230][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:20,554][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:20,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:21,206][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:21,532][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:21,863][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:22,516][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:22,842][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:23,173][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:23,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:24,213][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:24,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:24,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:24,947][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:26,811][__main__][INFO] - Iteration 131 took 22s (32.09% Gen, 59.47% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 39m 51s. Estimated total time: 18h 25m 28s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 50s, 500 more iterations: 3h 4m 14s. [2025-11-13 08:49:26,813][__main__][INFO] - Starting iteration 131. [2025-11-13 08:49:26,815][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:26,816][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:34,649][__main__][INFO] - Number of regex retries in iteration 131: 0 [2025-11-13 08:49:34,649][__main__][INFO] - agents played in iteration 131 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:49:35,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:35,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:35,934][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:36,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:36,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:36,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:37,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:37,536][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:38,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:39,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:39,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:40,146][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:40,473][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:40,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:41,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:41,780][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:42,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:42,770][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:43,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:44,083][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:44,410][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:44,735][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:45,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:45,714][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:46,046][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:46,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:47,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:47,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:47,861][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:47,863][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:48,838][__main__][INFO] - Iteration 132 took 22s (35.57% Gen, 60.00% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 35m 10s. Estimated total time: 18h 21m 10s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 42s, 500 more iterations: 3h 3m 31s. [2025-11-13 08:49:48,840][__main__][INFO] - Starting iteration 132. [2025-11-13 08:49:48,843][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:48,844][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:56,423][__main__][INFO] - Number of regex retries in iteration 132: 0 [2025-11-13 08:49:56,424][__main__][INFO] - agents played in iteration 132 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:49:56,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,986][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:56,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:57,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:58,661][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:58,987][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:59,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:00,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:00,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:00,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:01,270][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:01,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:01,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:02,251][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:02,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:02,905][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:03,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:03,558][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:03,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:04,212][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:04,867][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:05,193][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:05,523][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:05,850][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:06,178][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:06,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:06,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:07,806][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:08,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:08,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:09,611][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:09,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:09,614][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:10,710][__main__][INFO] - Iteration 133 took 21s (34.66% Gen, 60.32% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 1s. Estimated total time: 18h 13m 23s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 26s, 500 more iterations: 3h 2m 13s. [2025-11-13 08:50:10,712][__main__][INFO] - Starting iteration 133. [2025-11-13 08:50:10,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:10,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:18,209][__main__][INFO] - Number of regex retries in iteration 133: 0 [2025-11-13 08:50:18,209][__main__][INFO] - agents played in iteration 133 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:50:18,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:18,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:19,772][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:20,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:20,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:21,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:21,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:22,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:22,729][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:24,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:24,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:25,024][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:25,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:26,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:26,658][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:26,984][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:27,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:27,636][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:28,287][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:28,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:29,267][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:29,595][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:29,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:30,652][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:31,380][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:31,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:31,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:32,313][__main__][INFO] - Iteration 134 took 21s (34.70% Gen, 60.99% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 13m 15s. Estimated total time: 17h 59m 58s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 59s, 500 more iterations: 2h 59m 59s. [2025-11-13 08:50:32,315][__main__][INFO] - Starting iteration 134. [2025-11-13 08:50:32,318][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:32,318][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:39,919][__main__][INFO] - Number of regex retries in iteration 134: 0 [2025-11-13 08:50:39,920][__main__][INFO] - agents played in iteration 134 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:50:40,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,482][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:40,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:41,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:41,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:42,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:42,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:43,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:43,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:44,086][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:44,412][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:44,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:45,062][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:45,389][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:45,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:46,040][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:46,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:46,699][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:47,026][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:47,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:47,675][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:48,001][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:48,651][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:48,976][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:49,301][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:49,628][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:50,603][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:50,929][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:51,256][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:51,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:52,303][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:53,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:53,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:53,031][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:53,945][__main__][INFO] - Iteration 135 took 21s (35.14% Gen, 60.62% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 14m 18s. Estimated total time: 18h 1m 23s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 2s, 500 more iterations: 3h 0m 13s. [2025-11-13 08:50:53,947][__main__][INFO] - Starting iteration 135. [2025-11-13 08:50:53,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:53,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:01,285][__main__][INFO] - Number of regex retries in iteration 135: 0 [2025-11-13 08:51:01,286][__main__][INFO] - agents played in iteration 135 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:51:01,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:01,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:02,880][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:03,207][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:03,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:03,858][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:04,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:04,510][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:04,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:05,167][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:05,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:06,146][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:06,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:07,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:07,452][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:07,777][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:08,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:09,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:09,740][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:10,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:10,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:11,052][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:11,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:11,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:12,032][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:12,692][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:13,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:13,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:14,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:14,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:14,505][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:15,423][__main__][INFO] - Iteration 136 took 21s (34.16% Gen, 61.56% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 6m 17s. Estimated total time: 17h 53m 44s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 47s, 500 more iterations: 2h 58m 57s. [2025-11-13 08:51:15,425][__main__][INFO] - Starting iteration 136. [2025-11-13 08:51:15,428][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:15,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:23,086][__main__][INFO] - Number of regex retries in iteration 136: 0 [2025-11-13 08:51:23,087][__main__][INFO] - agents played in iteration 136 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:51:23,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:23,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:24,353][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:24,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:24,981][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:25,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:26,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:26,952][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:27,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:27,610][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:27,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:28,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:28,588][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:28,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:29,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:29,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:30,226][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:30,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:30,880][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:31,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:31,856][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:32,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:32,507][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:32,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:33,484][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:34,137][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:34,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:35,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:36,237][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:36,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:36,240][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:37,156][__main__][INFO] - Iteration 137 took 21s (35.25% Gen, 60.53% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 39s. Estimated total time: 18h 6m 27s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 12s, 500 more iterations: 3h 1m 4s. [2025-11-13 08:51:37,158][__main__][INFO] - Starting iteration 137. [2025-11-13 08:51:37,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:37,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:44,370][__main__][INFO] - Number of regex retries in iteration 137: 0 [2025-11-13 08:51:44,371][__main__][INFO] - agents played in iteration 137 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:51:44,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:44,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:45,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:46,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:46,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:47,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:47,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:47,898][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:48,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:49,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:49,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:50,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:50,837][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:51,164][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:51,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:52,142][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:52,794][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:53,119][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:54,098][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:55,726][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:56,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:56,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:57,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:57,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:57,504][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:58,470][__main__][INFO] - Iteration 138 took 21s (33.83% Gen, 61.63% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 57m 21s. Estimated total time: 17h 45m 30s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 31s, 500 more iterations: 2h 57m 35s. [2025-11-13 08:51:58,472][__main__][INFO] - Starting iteration 138. [2025-11-13 08:51:58,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:58,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:05,908][__main__][INFO] - Number of regex retries in iteration 138: 0 [2025-11-13 08:52:05,909][__main__][INFO] - agents played in iteration 138 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:52:06,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:06,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:07,209][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:07,504][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:07,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:08,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:09,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:09,463][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:10,122][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:10,448][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:11,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:11,433][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:12,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:12,416][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:13,069][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:13,396][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:14,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:14,378][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:14,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:15,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:15,360][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:16,013][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:16,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:16,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:17,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:17,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:18,366][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:19,099][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:19,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:19,102][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:20,070][__main__][INFO] - Iteration 139 took 21s (34.42% Gen, 61.09% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 11m 14s. Estimated total time: 17h 59m 45s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 59s, 500 more iterations: 2h 59m 57s. [2025-11-13 08:52:20,072][__main__][INFO] - Starting iteration 139. [2025-11-13 08:52:20,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:20,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:27,782][__main__][INFO] - Number of regex retries in iteration 139: 0 [2025-11-13 08:52:27,782][__main__][INFO] - agents played in iteration 139 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:52:28,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:28,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:29,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:29,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:29,667][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:29,991][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:30,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:30,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:30,970][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:31,296][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:31,621][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:32,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:32,927][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:33,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:33,582][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:34,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:34,565][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:34,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:35,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:36,533][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:36,860][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:37,187][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:37,512][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:37,838][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:38,164][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:38,490][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:38,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:39,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:39,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:40,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:40,910][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:40,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:40,913][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:41,864][__main__][INFO] - Iteration 140 took 21s (35.36% Gen, 60.26% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 37s. Estimated total time: 18h 9m 30s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 19s, 500 more iterations: 3h 1m 35s. [2025-11-13 08:52:41,866][__main__][INFO] - Starting iteration 140. [2025-11-13 08:52:41,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:41,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:49,430][__main__][INFO] - Number of regex retries in iteration 140: 0 [2025-11-13 08:52:49,430][__main__][INFO] - agents played in iteration 140 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:52:49,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:49,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:50,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:51,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:51,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:51,984][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:52,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:52,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:53,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:53,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:53,946][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:54,273][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:55,259][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:55,914][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:56,241][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:56,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:57,222][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:57,548][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:57,878][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:58,203][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:58,860][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:59,517][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:59,843][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:00,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:00,496][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:00,824][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:01,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:01,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:02,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:02,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:02,613][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:04,456][__main__][INFO] - Iteration 141 took 22s (33.47% Gen, 58.36% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 0m 7s. Estimated total time: 18h 49m 22s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 13s. [2025-11-13 08:53:04,458][__main__][INFO] - Starting iteration 141. [2025-11-13 08:53:04,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:04,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:12,063][__main__][INFO] - Number of regex retries in iteration 141: 0 [2025-11-13 08:53:12,064][__main__][INFO] - agents played in iteration 141 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:53:12,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:12,622][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:13,340][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:13,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:13,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:14,293][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:14,619][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:14,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:15,270][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:15,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:16,247][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:16,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:16,898][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:17,224][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:17,552][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:17,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:18,204][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:18,531][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:18,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:19,507][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:19,831][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:20,158][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:20,810][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:21,134][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:22,113][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:22,438][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:22,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:23,093][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:23,419][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:23,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:24,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:25,186][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:25,188][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:25,190][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:26,174][__main__][INFO] - Iteration 142 took 21s (35.01% Gen, 60.45% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 5s. Estimated total time: 18h 5m 42s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 11s, 500 more iterations: 3h 0m 57s. [2025-11-13 08:53:26,176][__main__][INFO] - Starting iteration 142. [2025-11-13 08:53:26,180][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:26,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:34,124][__main__][INFO] - Number of regex retries in iteration 142: 0 [2025-11-13 08:53:34,124][__main__][INFO] - agents played in iteration 142 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:53:34,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,684][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:34,684][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:35,400][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:35,697][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:36,678][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:37,655][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:37,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:38,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:38,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:40,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:40,592][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:40,918][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:41,245][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:41,570][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:41,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:42,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:42,880][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:43,532][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:43,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:44,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:44,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:45,160][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:45,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:46,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:47,270][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:47,272][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:47,273][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:48,197][__main__][INFO] - Iteration 143 took 22s (36.08% Gen, 59.72% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 30m 54s. Estimated total time: 18h 20m 53s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 41s, 500 more iterations: 3h 3m 28s. [2025-11-13 08:53:48,199][__main__][INFO] - Starting iteration 143. [2025-11-13 08:53:48,202][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:48,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:55,999][__main__][INFO] - Number of regex retries in iteration 143: 0 [2025-11-13 08:53:56,000][__main__][INFO] - agents played in iteration 143 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:53:56,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,557][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:56,558][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:57,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:57,893][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:58,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:58,544][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:58,869][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:59,194][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:59,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:59,849][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:00,174][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:00,502][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:00,827][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:01,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:02,139][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:02,465][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:02,791][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:03,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:03,772][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:04,100][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:04,426][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:05,078][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:05,405][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:05,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:06,058][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:06,384][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:06,711][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:07,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:07,362][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:07,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:08,408][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:09,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:09,134][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:09,136][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:10,086][__main__][INFO] - Iteration 144 took 21s (35.63% Gen, 60.03% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 23m 55s. Estimated total time: 18h 14m 16s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 28s, 500 more iterations: 3h 2m 22s. [2025-11-13 08:54:10,088][__main__][INFO] - Starting iteration 144. [2025-11-13 08:54:10,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:10,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:17,946][__main__][INFO] - Number of regex retries in iteration 144: 0 [2025-11-13 08:54:17,947][__main__][INFO] - agents played in iteration 144 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:54:18,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,507][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:18,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:19,206][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:19,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:19,829][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:20,479][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:21,130][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:21,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:21,781][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:22,432][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:22,756][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:23,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:23,408][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:23,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:24,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:24,387][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:24,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:25,364][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:25,689][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:26,013][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:26,337][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:26,664][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:26,989][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:27,316][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:27,642][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:27,969][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:28,295][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:28,621][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:28,946][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:29,278][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:29,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:30,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:31,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:31,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:31,079][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:32,055][__main__][INFO] - Iteration 145 took 21s (35.76% Gen, 59.79% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 30s. Estimated total time: 18h 18m 13s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 36s, 500 more iterations: 3h 3m 2s. [2025-11-13 08:54:32,056][__main__][INFO] - Starting iteration 145. [2025-11-13 08:54:32,059][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:32,060][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:39,648][__main__][INFO] - Number of regex retries in iteration 145: 0 [2025-11-13 08:54:39,648][__main__][INFO] - agents played in iteration 145 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:54:40,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,272][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:40,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:41,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:41,300][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:41,626][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:41,952][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:42,279][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:42,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:43,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:43,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:43,917][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:44,248][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:44,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:44,902][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:45,229][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:45,555][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:45,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:46,533][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:47,186][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:47,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:47,840][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:48,166][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:48,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:48,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:49,144][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:49,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:50,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:50,447][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:51,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:51,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:52,151][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:52,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:52,890][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:52,892][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:53,818][__main__][INFO] - Iteration 146 took 21s (34.87% Gen, 60.86% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 55s. Estimated total time: 18h 8m 0s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 16s, 500 more iterations: 3h 1m 20s. [2025-11-13 08:54:53,820][__main__][INFO] - Starting iteration 146. [2025-11-13 08:54:53,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:53,823][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:01,503][__main__][INFO] - Number of regex retries in iteration 146: 0 [2025-11-13 08:55:01,503][__main__][INFO] - agents played in iteration 146 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:55:01,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:01,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,063][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:02,063][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:03,417][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:03,745][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:04,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:04,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:05,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:05,704][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:06,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:06,358][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:06,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:07,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:07,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:07,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:08,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:09,308][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:09,960][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:10,290][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:10,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:11,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:11,924][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:12,250][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:12,575][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:12,901][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:13,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:13,943][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:14,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:14,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:14,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:15,647][__main__][INFO] - Iteration 147 took 21s (35.19% Gen, 60.41% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 19m 47s. Estimated total time: 18h 11m 14s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 22s, 500 more iterations: 3h 1m 52s. [2025-11-13 08:55:15,650][__main__][INFO] - Starting iteration 147. [2025-11-13 08:55:15,653][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:15,653][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:23,494][__main__][INFO] - Number of regex retries in iteration 147: 0 [2025-11-13 08:55:23,495][__main__][INFO] - agents played in iteration 147 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:55:23,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:23,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:24,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:24,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:25,081][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:26,393][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:26,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:27,047][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:27,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:28,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:28,350][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:28,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:29,002][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:29,329][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:29,979][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:30,305][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:30,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:31,606][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:31,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:32,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:32,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:32,917][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:33,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:33,899][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:34,228][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:34,553][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:35,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:35,950][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:36,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:36,706][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:36,708][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:37,684][__main__][INFO] - Iteration 148 took 22s (35.59% Gen, 59.97% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 29m 48s. Estimated total time: 18h 21m 36s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 43s, 500 more iterations: 3h 3m 36s. [2025-11-13 08:55:37,686][__main__][INFO] - Starting iteration 148. [2025-11-13 08:55:37,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:37,689][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:45,551][__main__][INFO] - Number of regex retries in iteration 148: 0 [2025-11-13 08:55:45,551][__main__][INFO] - agents played in iteration 148 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:55:46,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:46,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:46,817][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:47,113][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:47,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:47,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:48,092][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:48,748][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:49,080][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:49,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:50,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:50,716][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:51,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:51,697][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:52,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:52,353][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:52,678][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:53,330][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:53,655][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:53,980][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:54,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:54,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:55,619][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:55,946][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:56,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:56,924][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:57,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:57,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:58,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:58,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:58,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:59,614][__main__][INFO] - Iteration 149 took 21s (35.86% Gen, 59.92% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 9s. Estimated total time: 18h 16m 19s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 32s, 500 more iterations: 3h 2m 43s. [2025-11-13 08:55:59,617][__main__][INFO] - Starting iteration 149. [2025-11-13 08:55:59,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:59,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:07,290][__main__][INFO] - Number of regex retries in iteration 149: 0 [2025-11-13 08:56:07,290][__main__][INFO] - agents played in iteration 149 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:56:07,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,846][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:07,847][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:08,573][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:08,870][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:09,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:09,522][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:09,848][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:10,175][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:10,501][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:10,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:11,152][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:11,478][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:12,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:12,779][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:13,105][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:13,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:14,408][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:14,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:15,388][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:15,713][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:16,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:16,692][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:18,331][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:18,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:18,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:19,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:20,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:20,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:20,444][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:21,395][__main__][INFO] - Iteration 150 took 21s (35.22% Gen, 60.40% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 17s. Estimated total time: 18h 8m 49s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 17s, 500 more iterations: 3h 1m 28s. [2025-11-13 08:56:21,397][__main__][INFO] - Starting iteration 150. [2025-11-13 08:56:21,401][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:56:21,401][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:29,252][__main__][INFO] - Number of regex retries in iteration 150: 0 [2025-11-13 08:56:29,252][__main__][INFO] - agents played in iteration 150 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:56:29,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,815][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:29,815][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:30,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:31,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:31,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:31,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:32,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:32,457][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:32,782][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:33,112][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:33,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:33,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:34,097][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:34,422][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:34,752][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:35,077][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:35,733][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:36,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:36,387][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:36,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:37,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:37,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:37,699][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:38,027][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:39,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:39,664][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:39,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:40,648][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:40,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:41,701][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:42,445][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:42,447][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:42,448][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:44,276][__main__][INFO] - Iteration 151 took 22s (34.32% Gen, 57.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 10m 53s. Estimated total time: 19h 3m 48s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 38s. [2025-11-13 08:56:44,278][__main__][INFO] - Starting iteration 151. [2025-11-13 08:56:44,280][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:56:44,281][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:52,570][__main__][INFO] - Number of regex retries in iteration 151: 0 [2025-11-13 08:56:52,570][__main__][INFO] - agents played in iteration 151 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:56:53,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,140][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:53,140][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:53,855][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:54,479][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:54,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:55,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:56,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:56,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:57,410][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:57,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:58,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:58,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:59,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:00,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:00,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:01,012][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:01,666][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:01,991][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:02,648][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:02,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:03,963][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:04,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:05,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:05,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:05,734][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:05,736][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:06,649][__main__][INFO] - Iteration 152 took 22s (37.06% Gen, 58.85% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 45m 11s. Estimated total time: 18h 38m 29s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 16s, 500 more iterations: 3h 6m 24s. [2025-11-13 08:57:06,651][__main__][INFO] - Starting iteration 152. [2025-11-13 08:57:06,653][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:06,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:14,438][__main__][INFO] - Number of regex retries in iteration 152: 0 [2025-11-13 08:57:14,438][__main__][INFO] - agents played in iteration 152 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:57:14,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:14,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:16,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:16,340][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:16,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:17,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:17,649][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:17,974][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:18,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:18,624][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:19,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:19,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:20,252][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:20,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:20,904][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:21,887][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:22,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:22,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:23,839][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:24,490][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:24,816][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:25,142][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:25,471][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:25,797][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:26,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:26,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:27,572][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:27,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:27,575][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:28,475][__main__][INFO] - Iteration 153 took 21s (35.67% Gen, 60.20% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 17m 28s. Estimated total time: 18h 11m 7s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 22s, 500 more iterations: 3h 1m 51s. [2025-11-13 08:57:28,477][__main__][INFO] - Starting iteration 153. [2025-11-13 08:57:28,480][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:28,481][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:36,428][__main__][INFO] - Number of regex retries in iteration 153: 0 [2025-11-13 08:57:36,429][__main__][INFO] - agents played in iteration 153 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:57:36,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:36,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:36,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:36,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:36,980][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:36,981][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:37,696][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:37,994][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:38,319][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:38,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:39,297][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:39,623][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:39,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:40,272][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:40,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:40,922][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:41,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:41,901][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:42,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:42,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:42,879][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:43,531][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:44,513][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:44,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:45,164][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:45,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:45,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:46,142][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:46,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:47,119][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:47,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:47,773][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:48,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:48,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:49,503][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:49,505][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:49,507][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:50,386][__main__][INFO] - Iteration 154 took 21s (36.28% Gen, 59.70% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 21m 18s. Estimated total time: 18h 15m 19s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 30s, 500 more iterations: 3h 2m 33s. [2025-11-13 08:57:50,388][__main__][INFO] - Starting iteration 154. [2025-11-13 08:57:50,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:50,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:56,285][mllm.models.large_language_model_local][WARNING] - Response :A did not match regex: (|), retry 1/1 [2025-11-13 08:57:58,370][__main__][INFO] - Number of regex retries in iteration 154: 1 [2025-11-13 08:57:58,370][__main__][INFO] - agents played in iteration 154 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:57:58,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:58,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:58,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:58,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:58,921][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:58,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:59,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:59,933][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:00,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:00,913][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:01,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:01,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:01,899][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:02,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:02,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:03,859][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:04,843][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:05,168][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:05,498][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:05,830][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:06,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:06,489][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:06,816][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:07,144][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:07,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:08,451][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:09,105][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:09,432][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:09,760][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:10,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:10,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:11,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:11,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:11,500][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:12,386][__main__][INFO] - Iteration 155 took 21s (36.27% Gen, 59.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 25m 25s. Estimated total time: 18h 19m 48s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 39s, 500 more iterations: 3h 3m 18s. [2025-11-13 08:58:12,388][__main__][INFO] - Starting iteration 155. [2025-11-13 08:58:12,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:12,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:20,330][__main__][INFO] - Number of regex retries in iteration 155: 0 [2025-11-13 08:58:20,331][__main__][INFO] - agents played in iteration 155 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:58:20,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:20,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:20,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:20,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:20,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:20,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:21,619][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:21,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:22,246][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:22,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:23,226][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:23,877][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:24,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:24,529][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:24,854][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:25,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:25,507][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:25,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:26,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:27,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:27,464][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:28,122][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:28,449][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:28,777][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:29,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:29,763][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:30,087][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:30,416][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:31,074][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:31,404][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:31,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:32,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:32,785][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:33,497][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:33,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:33,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:34,376][__main__][INFO] - Iteration 156 took 21s (36.11% Gen, 59.89% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 36s. Estimated total time: 18h 19m 21s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 38s, 500 more iterations: 3h 3m 13s. [2025-11-13 08:58:34,378][__main__][INFO] - Starting iteration 156. [2025-11-13 08:58:34,382][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:34,383][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:42,343][__main__][INFO] - Number of regex retries in iteration 156: 0 [2025-11-13 08:58:42,343][__main__][INFO] - agents played in iteration 156 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:58:42,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:42,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:42,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:42,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:42,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:42,901][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:43,600][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:43,898][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:44,228][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:44,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:44,881][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:45,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:45,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:45,867][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:46,192][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:46,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:47,505][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:47,833][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:48,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:48,492][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:48,824][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:49,478][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:50,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:50,783][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:51,111][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:51,439][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:51,767][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:52,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:52,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:52,749][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:53,080][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:53,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:53,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:54,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:54,777][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:55,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:55,483][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:55,485][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:56,371][__main__][INFO] - Iteration 157 took 21s (36.20% Gen, 59.76% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 21s. Estimated total time: 18h 19m 29s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 38s, 500 more iterations: 3h 3m 14s. [2025-11-13 08:58:56,375][__main__][INFO] - Starting iteration 157. [2025-11-13 08:58:56,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:56,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:04,004][__main__][INFO] - Number of regex retries in iteration 157: 0 [2025-11-13 08:59:04,004][__main__][INFO] - agents played in iteration 157 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:59:04,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,567][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:04,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:05,295][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:05,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:06,246][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:06,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:07,552][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:07,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:08,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:08,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:09,187][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:09,842][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:10,170][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:10,498][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:10,824][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:11,150][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:11,474][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:11,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:12,128][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:12,455][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:12,782][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:14,415][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:14,742][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:15,068][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:15,393][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:15,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:16,442][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:17,128][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:17,130][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:17,132][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:18,075][__main__][INFO] - Iteration 158 took 21s (35.15% Gen, 60.50% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 9m 24s. Estimated total time: 18h 4m 53s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 9s, 500 more iterations: 3h 0m 48s. [2025-11-13 08:59:18,077][__main__][INFO] - Starting iteration 158. [2025-11-13 08:59:18,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:18,081][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:25,967][__main__][INFO] - Number of regex retries in iteration 158: 0 [2025-11-13 08:59:25,967][__main__][INFO] - agents played in iteration 158 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:59:26,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,527][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:26,527][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:27,260][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:27,557][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:27,886][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:28,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:30,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:31,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:32,775][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:33,101][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:33,427][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:33,756][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:34,736][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:35,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:35,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:35,718][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:36,043][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:36,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:36,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:37,028][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:37,354][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:37,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:38,408][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:39,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:39,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:39,125][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:40,087][__main__][INFO] - Iteration 159 took 22s (35.83% Gen, 59.79% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 32s. Estimated total time: 18h 20m 23s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 40s, 500 more iterations: 3h 3m 23s. [2025-11-13 08:59:40,089][__main__][INFO] - Starting iteration 159. [2025-11-13 08:59:40,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:40,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:47,497][__main__][INFO] - Number of regex retries in iteration 159: 0 [2025-11-13 08:59:47,498][__main__][INFO] - agents played in iteration 159 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 08:59:47,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:47,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:48,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:49,068][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:49,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:50,047][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:50,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:50,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:51,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:51,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:51,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:52,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:52,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:52,983][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:53,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:53,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:53,961][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:54,287][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:54,613][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:54,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:55,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:56,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:57,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:58,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:58,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:59,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:59,906][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:00,620][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:00,621][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:00,623][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:01,586][__main__][INFO] - Iteration 160 took 21s (34.45% Gen, 61.06% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 58m 31s. Estimated total time: 17h 54m 43s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 49s, 500 more iterations: 2h 59m 7s. [2025-11-13 09:00:01,588][__main__][INFO] - Starting iteration 160. [2025-11-13 09:00:01,592][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 09:00:01,592][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:09,297][__main__][INFO] - Number of regex retries in iteration 160: 0 [2025-11-13 09:00:09,298][__main__][INFO] - agents played in iteration 160 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:00:09,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:09,863][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:11,542][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:11,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:12,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:12,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:12,852][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:13,503][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:13,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:14,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:14,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:14,809][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:15,137][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:16,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:16,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:16,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:17,433][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:17,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:18,087][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:19,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:19,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:20,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:21,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:21,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:22,458][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:22,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:22,461][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:24,367][__main__][INFO] - Iteration 161 took 22s (33.83% Gen, 57.79% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 2m 12s. Estimated total time: 18h 58m 47s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 47s. [2025-11-13 09:00:24,369][__main__][INFO] - Starting iteration 161. [2025-11-13 09:00:24,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:24,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:32,460][__main__][INFO] - Number of regex retries in iteration 161: 0 [2025-11-13 09:00:32,461][__main__][INFO] - agents played in iteration 161 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:00:32,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:32,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:32,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:33,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:34,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:34,358][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:34,682][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:35,336][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:36,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:36,966][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:37,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:37,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:37,945][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:38,271][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:38,597][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:38,926][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:39,254][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:39,583][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:39,909][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:40,235][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:40,563][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:40,890][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:41,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:41,544][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:41,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:42,197][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:42,522][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:42,848][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:43,174][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:43,499][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:44,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:44,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:45,585][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:45,587][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:45,588][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:46,576][__main__][INFO] - Iteration 162 took 22s (36.43% Gen, 59.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 33m 18s. Estimated total time: 18h 30m 15s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 0s, 500 more iterations: 3h 5m 2s. [2025-11-13 09:00:46,578][__main__][INFO] - Starting iteration 162. [2025-11-13 09:00:46,582][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:46,582][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:54,630][__main__][INFO] - Number of regex retries in iteration 162: 0 [2025-11-13 09:00:54,631][__main__][INFO] - agents played in iteration 162 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:00:55,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:55,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:55,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:55,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:55,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:55,174][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:56,170][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:56,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:57,475][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:58,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:58,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:59,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:59,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:59,765][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:00,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:00,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:01,075][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:01,401][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:02,054][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:02,705][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:03,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:04,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:04,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:04,995][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:05,322][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:06,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:07,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:07,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:07,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:07,770][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:08,736][__main__][INFO] - Iteration 163 took 22s (36.33% Gen, 59.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 30m 26s. Estimated total time: 18h 27m 46s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 55s, 500 more iterations: 3h 4m 37s. [2025-11-13 09:01:08,738][__main__][INFO] - Starting iteration 163. [2025-11-13 09:01:08,741][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:08,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:16,647][__main__][INFO] - Number of regex retries in iteration 163: 0 [2025-11-13 09:01:16,648][__main__][INFO] - agents played in iteration 163 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:01:17,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:17,205][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:17,923][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:18,218][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:19,204][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:19,533][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:19,859][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:20,189][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:21,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:21,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:22,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:23,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:23,467][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:24,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:24,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:24,773][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:25,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:25,753][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:26,085][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:26,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:26,745][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:27,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:27,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:27,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:28,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:29,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:29,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:29,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:29,802][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:30,779][__main__][INFO] - Iteration 164 took 22s (35.87% Gen, 59.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 15s. Estimated total time: 18h 21m 57s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 43s, 500 more iterations: 3h 3m 39s. [2025-11-13 09:01:30,782][__main__][INFO] - Starting iteration 164. [2025-11-13 09:01:30,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:30,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:38,425][__main__][INFO] - Number of regex retries in iteration 164: 0 [2025-11-13 09:01:38,426][__main__][INFO] - agents played in iteration 164 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:01:38,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:38,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:38,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:38,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:38,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:38,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:39,709][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:40,005][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:40,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:40,658][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:40,984][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:41,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:41,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:42,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:42,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:42,955][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:43,280][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:43,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:44,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:44,584][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:44,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:45,888][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:46,213][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:46,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:47,199][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:47,523][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:47,848][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:48,174][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:48,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:48,827][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:49,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:49,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:49,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:50,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:50,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:51,600][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:51,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:51,603][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:52,560][__main__][INFO] - Iteration 165 took 21s (35.08% Gen, 60.51% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 10m 45s. Estimated total time: 18h 8m 48s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 17s, 500 more iterations: 3h 1m 28s. [2025-11-13 09:01:52,562][__main__][INFO] - Starting iteration 165. [2025-11-13 09:01:52,565][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:52,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:00,529][__main__][INFO] - Number of regex retries in iteration 165: 0 [2025-11-13 09:02:00,529][__main__][INFO] - agents played in iteration 165 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:02:00,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,086][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:01,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:02,096][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:03,080][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:03,735][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:04,392][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:04,718][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:05,045][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:05,373][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:06,025][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:06,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:07,981][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:08,307][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:08,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:08,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:09,292][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:09,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:09,950][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:10,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:11,268][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:11,595][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:11,928][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:12,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:12,962][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:13,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:13,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:13,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:14,632][__main__][INFO] - Iteration 166 took 22s (36.08% Gen, 59.64% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 57s. Estimated total time: 18h 23m 22s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 46s, 500 more iterations: 3h 3m 53s. [2025-11-13 09:02:14,634][__main__][INFO] - Starting iteration 166. [2025-11-13 09:02:14,636][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:14,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:22,187][__main__][INFO] - Number of regex retries in iteration 166: 0 [2025-11-13 09:02:22,188][__main__][INFO] - agents played in iteration 166 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:02:22,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,748][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:22,749][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:23,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:23,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:24,096][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:24,425][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:25,079][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:25,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:25,731][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:26,058][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:27,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:27,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:27,687][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:28,339][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:28,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:28,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:29,326][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:29,651][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:30,309][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:30,637][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:31,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:31,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:32,281][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:32,606][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:32,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:33,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:33,596][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:33,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:34,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:35,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:35,395][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:35,397][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:36,319][__main__][INFO] - Iteration 167 took 21s (34.82% Gen, 60.92% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 5m 22s. Estimated total time: 18h 4m 10s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 8s, 500 more iterations: 3h 0m 41s. [2025-11-13 09:02:36,321][__main__][INFO] - Starting iteration 167. [2025-11-13 09:02:36,324][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:36,324][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:44,274][__main__][INFO] - Number of regex retries in iteration 167: 0 [2025-11-13 09:02:44,275][__main__][INFO] - agents played in iteration 167 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:02:44,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,840][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:44,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:45,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:45,854][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:46,182][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:46,514][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:47,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:47,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:48,149][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:48,476][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:48,801][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:49,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:49,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:49,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:50,105][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:50,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:51,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:51,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:53,052][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:53,378][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:54,033][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:54,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:55,013][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:55,340][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:55,668][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:55,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:56,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:57,451][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:57,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:57,454][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:58,444][__main__][INFO] - Iteration 168 took 22s (35.94% Gen, 59.58% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 26m 53s. Estimated total time: 18h 26m 2s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 52s, 500 more iterations: 3h 4m 20s. [2025-11-13 09:02:58,446][__main__][INFO] - Starting iteration 168. [2025-11-13 09:02:58,448][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:58,449][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:06,439][__main__][INFO] - Number of regex retries in iteration 168: 0 [2025-11-13 09:03:06,440][__main__][INFO] - agents played in iteration 168 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:03:06,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:07,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:07,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:07,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:07,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:08,028][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:08,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:08,682][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:09,007][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:09,334][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:09,987][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:10,313][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:10,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:10,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:11,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:11,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:11,943][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:12,269][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:12,917][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:13,243][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:13,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:13,897][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:14,225][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:14,551][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:15,203][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:15,530][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:15,854][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:16,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:16,507][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:16,832][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:17,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:17,808][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:18,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:18,847][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:19,563][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:19,564][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:19,566][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:20,511][__main__][INFO] - Iteration 169 took 22s (36.22% Gen, 59.49% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 23m 39s. Estimated total time: 18h 23m 10s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 46s, 500 more iterations: 3h 3m 51s. [2025-11-13 09:03:20,513][__main__][INFO] - Starting iteration 169. [2025-11-13 09:03:20,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:20,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:28,475][__main__][INFO] - Number of regex retries in iteration 169: 0 [2025-11-13 09:03:28,475][__main__][INFO] - agents played in iteration 169 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:03:28,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:29,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:29,030][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:29,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:29,752][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:30,050][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:30,711][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:31,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:31,692][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:32,019][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:32,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:32,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:33,650][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:33,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:34,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:34,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:35,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:35,621][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:35,947][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:36,605][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:36,932][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:37,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:37,922][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:38,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:39,235][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:39,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:39,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:40,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:40,931][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:41,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:41,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:41,641][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:42,624][__main__][INFO] - Iteration 170 took 22s (36.00% Gen, 59.55% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 25m 30s. Estimated total time: 18h 25m 24s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 50s, 500 more iterations: 3h 4m 14s. [2025-11-13 09:03:42,626][__main__][INFO] - Starting iteration 170. [2025-11-13 09:03:42,628][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:42,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:48,026][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 09:03:50,586][__main__][INFO] - Number of regex retries in iteration 170: 1 [2025-11-13 09:03:50,586][__main__][INFO] - agents played in iteration 170 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:03:51,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:51,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:51,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:51,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:51,512][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:51,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:52,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:52,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:53,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:53,508][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:53,839][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:54,826][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:55,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:55,811][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:56,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:56,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:57,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:57,453][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:58,110][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:58,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:59,091][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:59,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:59,744][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:00,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:00,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:01,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:02,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:03,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:04,111][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:04,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:04,114][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:05,965][__main__][INFO] - Iteration 171 took 23s (34.10% Gen, 57.97% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 26m 35s. Estimated total time: 19h 26m 52s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 28s. [2025-11-13 09:04:05,967][__main__][INFO] - Starting iteration 171. [2025-11-13 09:04:05,970][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:05,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:14,101][__main__][INFO] - Number of regex retries in iteration 171: 0 [2025-11-13 09:04:14,101][__main__][INFO] - agents played in iteration 171 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:04:14,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,661][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:14,661][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:15,372][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:15,667][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:15,995][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:16,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:16,973][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:17,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:17,953][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:18,279][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:19,258][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:19,585][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:19,911][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:20,237][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:20,565][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:20,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:21,219][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:22,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:23,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:23,505][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:23,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:24,160][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:24,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:25,138][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:25,466][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:25,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:26,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:27,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:27,248][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:27,250][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:28,145][__main__][INFO] - Iteration 172 took 22s (36.67% Gen, 59.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 7s. Estimated total time: 18h 28m 46s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 57s, 500 more iterations: 3h 4m 47s. [2025-11-13 09:04:28,147][__main__][INFO] - Starting iteration 172. [2025-11-13 09:04:28,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:28,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:35,691][__main__][INFO] - Number of regex retries in iteration 172: 0 [2025-11-13 09:04:35,692][__main__][INFO] - agents played in iteration 172 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:04:36,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,253][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:36,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:36,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:37,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:37,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:38,578][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:38,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:39,230][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:39,557][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:40,540][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:41,193][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:41,518][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:41,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:42,827][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:43,152][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:43,807][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:44,784][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:45,111][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:45,439][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:45,763][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:46,088][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:46,414][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:46,738][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:47,065][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:47,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:48,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:48,811][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:48,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:48,814][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:49,794][__main__][INFO] - Iteration 173 took 21s (34.84% Gen, 60.62% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 1m 16s. Estimated total time: 18h 2m 17s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 4s, 500 more iterations: 3h 0m 22s. [2025-11-13 09:04:49,797][__main__][INFO] - Starting iteration 173. [2025-11-13 09:04:49,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:49,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:57,580][__main__][INFO] - Number of regex retries in iteration 173: 0 [2025-11-13 09:04:57,581][__main__][INFO] - agents played in iteration 173 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:04:58,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,149][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:58,149][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:58,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:59,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:59,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:59,820][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:00,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:00,802][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:01,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:01,456][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:01,787][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:02,116][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:02,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:02,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:03,100][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:04,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:05,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:06,043][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:06,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:06,696][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:07,673][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:07,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:08,974][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:09,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:10,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:10,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:10,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:10,744][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:11,732][__main__][INFO] - Iteration 174 took 21s (35.48% Gen, 60.01% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 18s. Estimated total time: 18h 16m 40s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 33s, 500 more iterations: 3h 2m 46s. [2025-11-13 09:05:11,734][__main__][INFO] - Starting iteration 174. [2025-11-13 09:05:11,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:11,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:19,742][__main__][INFO] - Number of regex retries in iteration 174: 0 [2025-11-13 09:05:19,742][__main__][INFO] - agents played in iteration 174 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:05:20,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:20,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:21,038][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:21,334][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:21,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:21,990][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:22,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:22,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:22,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:23,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:23,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:23,965][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:24,294][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:24,620][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:24,947][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:25,273][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:25,599][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:25,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:26,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:26,576][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:26,903][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:27,560][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:27,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:28,214][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:28,541][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:28,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:29,198][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:29,523][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:29,850][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:30,174][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:30,499][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:30,826][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:31,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:31,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:32,191][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:32,923][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:32,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:32,926][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:33,893][__main__][INFO] - Iteration 175 took 22s (36.13% Gen, 59.50% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 26m 5s. Estimated total time: 18h 27m 50s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 55s, 500 more iterations: 3h 4m 38s. [2025-11-13 09:05:33,895][__main__][INFO] - Starting iteration 175. [2025-11-13 09:05:33,899][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:33,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:42,315][__main__][INFO] - Number of regex retries in iteration 175: 0 [2025-11-13 09:05:42,316][__main__][INFO] - agents played in iteration 175 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:05:42,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,872][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:42,872][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:43,581][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:43,877][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:44,860][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:46,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:47,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:47,808][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:48,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:48,461][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:48,788][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:49,441][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:50,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:50,749][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:51,077][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:51,405][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:51,732][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:52,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:53,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:53,693][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:54,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:54,737][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:55,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:55,464][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:55,466][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:56,397][__main__][INFO] - Iteration 176 took 22s (37.41% Gen, 58.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 42m 51s. Estimated total time: 18h 44m 59s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 29s, 500 more iterations: 3h 7m 29s. [2025-11-13 09:05:56,399][__main__][INFO] - Starting iteration 176. [2025-11-13 09:05:56,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:56,403][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:04,764][__main__][INFO] - Number of regex retries in iteration 176: 0 [2025-11-13 09:06:04,765][__main__][INFO] - agents played in iteration 176 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:06:05,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:05,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:06,044][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:06,341][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:06,998][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:07,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:07,987][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:08,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:08,976][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:09,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:09,950][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:10,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:11,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:11,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:12,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:13,868][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:14,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:15,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:15,499][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:15,826][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:16,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:17,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:17,914][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:17,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:17,917][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:18,868][__main__][INFO] - Iteration 177 took 22s (37.22% Gen, 58.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 40m 51s. Estimated total time: 18h 43m 20s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 26s, 500 more iterations: 3h 7m 13s. [2025-11-13 09:06:18,870][__main__][INFO] - Starting iteration 177. [2025-11-13 09:06:18,873][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:18,874][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:27,321][__main__][INFO] - Number of regex retries in iteration 177: 0 [2025-11-13 09:06:27,321][__main__][INFO] - agents played in iteration 177 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:06:27,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:27,879][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:28,577][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:29,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:29,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:30,298][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:31,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:31,596][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:31,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:32,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:33,222][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:33,874][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:34,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:35,178][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:35,834][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:36,487][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:36,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:37,137][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:37,790][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:38,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:39,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:39,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:40,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:40,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:40,511][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:41,426][__main__][INFO] - Iteration 178 took 22s (37.46% Gen, 58.48% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 47s. Estimated total time: 18h 47m 40s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 35s, 500 more iterations: 3h 7m 56s. [2025-11-13 09:06:41,428][__main__][INFO] - Starting iteration 178. [2025-11-13 09:06:41,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:41,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:49,786][__main__][INFO] - Number of regex retries in iteration 178: 0 [2025-11-13 09:06:49,787][__main__][INFO] - agents played in iteration 178 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:06:50,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:50,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:51,088][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:51,385][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:51,713][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:52,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:53,346][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:53,998][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:54,325][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:54,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:55,300][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:55,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:55,951][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:56,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:56,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:57,259][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:00,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:00,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:01,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:02,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:02,942][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:02,943][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:02,945][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:03,905][__main__][INFO] - Iteration 179 took 22s (37.18% Gen, 58.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 40m 31s. Estimated total time: 18h 43m 45s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 27s, 500 more iterations: 3h 7m 17s. [2025-11-13 09:07:03,907][__main__][INFO] - Starting iteration 179. [2025-11-13 09:07:03,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:03,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:12,616][__main__][INFO] - Number of regex retries in iteration 179: 0 [2025-11-13 09:07:12,617][__main__][INFO] - agents played in iteration 179 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:07:13,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:13,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:14,214][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:14,541][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:14,868][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:15,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:15,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:16,501][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:17,803][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:18,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:18,780][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:19,110][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:20,749][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:21,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:22,379][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:23,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:23,683][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:24,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:25,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:25,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:25,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:25,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:26,794][__main__][INFO] - Iteration 180 took 22s (38.04% Gen, 57.59% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 37s. Estimated total time: 19h 4m 14s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 42s. [2025-11-13 09:07:26,796][__main__][INFO] - Starting iteration 180. [2025-11-13 09:07:26,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:26,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:35,257][__main__][INFO] - Number of regex retries in iteration 180: 0 [2025-11-13 09:07:35,258][__main__][INFO] - agents played in iteration 180 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:07:35,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,809][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:35,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:36,536][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:36,833][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:37,160][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:37,492][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:37,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:38,150][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:38,478][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:38,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:39,131][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:39,459][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:39,784][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:40,112][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:40,436][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:40,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:41,090][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:41,744][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:42,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:42,722][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:43,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:43,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:44,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:44,351][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:45,005][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:45,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:45,667][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:45,997][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:46,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:47,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:48,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:48,414][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:48,416][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:50,279][__main__][INFO] - Iteration 181 took 23s (36.02% Gen, 56.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 30m 1s. Estimated total time: 19h 34m 2s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 40s. [2025-11-13 09:07:50,281][__main__][INFO] - Starting iteration 181. [2025-11-13 09:07:50,284][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:07:50,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:59,397][__main__][INFO] - Number of regex retries in iteration 181: 0 [2025-11-13 09:07:59,398][__main__][INFO] - agents played in iteration 181 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:07:59,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:59,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:00,660][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:01,287][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:01,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:01,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:03,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:03,910][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:04,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:04,565][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:05,225][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:05,885][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:06,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:06,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:07,193][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:07,522][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:08,175][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:08,500][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:08,825][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:09,150][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:09,476][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:09,803][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:10,127][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:10,455][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:11,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:11,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:12,542][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:12,544][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:12,545][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:13,484][__main__][INFO] - Iteration 182 took 23s (39.28% Gen, 56.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 37s. Estimated total time: 19h 20m 1s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 20s. [2025-11-13 09:08:13,486][__main__][INFO] - Starting iteration 182. [2025-11-13 09:08:13,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:13,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:22,427][__main__][INFO] - Number of regex retries in iteration 182: 0 [2025-11-13 09:08:22,428][__main__][INFO] - agents played in iteration 182 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:08:22,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:22,997][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:23,994][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:24,321][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:24,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:24,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:25,954][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:26,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:26,605][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:26,936][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:27,592][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:27,920][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:28,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:28,572][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:29,225][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:30,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:31,508][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:32,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:32,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:34,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:34,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:35,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:35,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:35,578][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:36,456][__main__][INFO] - Iteration 183 took 22s (38.91% Gen, 57.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 39s. Estimated total time: 19h 8m 27s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 24s. [2025-11-13 09:08:36,458][__main__][INFO] - Starting iteration 183. [2025-11-13 09:08:36,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:36,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:45,129][__main__][INFO] - Number of regex retries in iteration 183: 0 [2025-11-13 09:08:45,130][__main__][INFO] - agents played in iteration 183 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:08:45,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:45,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:47,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:47,684][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:48,664][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:48,989][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:49,315][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:49,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:50,294][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:50,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:50,947][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:51,600][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:51,926][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:52,252][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:52,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:52,908][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:53,558][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:53,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:54,210][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:54,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:55,189][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:56,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:56,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:56,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:57,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:58,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:58,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:58,279][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:59,225][__main__][INFO] - Iteration 184 took 22s (38.08% Gen, 57.76% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 53m 4s. Estimated total time: 18h 58m 14s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 42s. [2025-11-13 09:08:59,226][__main__][INFO] - Starting iteration 184. [2025-11-13 09:08:59,229][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:59,229][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:08,122][__main__][INFO] - Number of regex retries in iteration 184: 0 [2025-11-13 09:09:08,122][__main__][INFO] - agents played in iteration 184 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:09:08,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,686][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:08,687][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:09,409][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:09,705][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:10,032][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:10,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:10,687][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:11,012][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:11,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:12,649][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:12,975][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:13,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:13,628][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:13,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:14,280][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:14,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:14,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:15,261][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:15,914][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:16,240][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:16,565][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:16,892][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:18,204][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:18,534][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:18,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:19,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:19,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:19,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:20,571][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:21,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:21,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:21,318][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:22,259][__main__][INFO] - Iteration 185 took 23s (38.61% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 6m 0s. Estimated total time: 19h 11m 33s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 55s. [2025-11-13 09:09:22,261][__main__][INFO] - Starting iteration 185. [2025-11-13 09:09:22,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:22,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:30,796][__main__][INFO] - Number of regex retries in iteration 185: 0 [2025-11-13 09:09:30,797][__main__][INFO] - agents played in iteration 185 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:09:31,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:31,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:32,362][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:32,689][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:33,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:33,671][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:34,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:34,974][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:35,952][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:36,277][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:36,929][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:37,256][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:37,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:37,909][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:38,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:38,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:39,220][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:39,547][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:39,873][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:41,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:41,502][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:41,827][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:42,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:42,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:43,209][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:43,941][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:43,943][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:43,945][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:44,877][__main__][INFO] - Iteration 186 took 22s (37.73% Gen, 58.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 47s. Estimated total time: 18h 50m 43s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 41s, 500 more iterations: 3h 8m 27s. [2025-11-13 09:09:44,879][__main__][INFO] - Starting iteration 186. [2025-11-13 09:09:44,882][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:44,882][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:53,766][__main__][INFO] - Number of regex retries in iteration 186: 0 [2025-11-13 09:09:53,767][__main__][INFO] - agents played in iteration 186 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:09:54,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,322][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:54,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:55,034][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:55,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:55,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:56,320][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:56,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:56,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:57,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:57,627][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:57,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:58,614][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:58,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:59,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:59,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:59,923][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:00,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:00,574][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:00,899][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:01,225][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:01,551][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:02,536][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:03,191][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:03,519][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:04,506][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:04,833][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:05,164][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:05,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:06,214][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:06,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:06,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:06,946][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:07,877][__main__][INFO] - Iteration 187 took 22s (38.63% Gen, 57.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 28s. Estimated total time: 19h 9m 46s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 37s. [2025-11-13 09:10:07,879][__main__][INFO] - Starting iteration 187. [2025-11-13 09:10:07,881][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:07,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:16,437][__main__][INFO] - Number of regex retries in iteration 187: 0 [2025-11-13 09:10:16,438][__main__][INFO] - agents played in iteration 187 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:10:16,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:16,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:16,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:17,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:17,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:19,027][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:19,678][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:20,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:20,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:20,654][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:20,979][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:21,304][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:21,630][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:21,957][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:22,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:22,609][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:22,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:23,258][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:24,235][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:24,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:24,887][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:25,214][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:25,540][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:25,865][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:26,192][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:26,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:26,846][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:27,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:28,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:28,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:29,628][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:29,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:29,631][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:30,572][__main__][INFO] - Iteration 188 took 22s (37.70% Gen, 58.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 47m 53s. Estimated total time: 18h 54m 35s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 5s. [2025-11-13 09:10:30,574][__main__][INFO] - Starting iteration 188. [2025-11-13 09:10:30,577][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:30,577][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:39,594][__main__][INFO] - Number of regex retries in iteration 188: 0 [2025-11-13 09:10:39,594][__main__][INFO] - agents played in iteration 188 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:10:40,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,155][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:40,156][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:40,901][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:41,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:41,526][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:43,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:44,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:45,121][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:45,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:45,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:46,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:46,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:47,079][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:47,407][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:47,732][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:48,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:49,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:49,691][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:50,019][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:50,345][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:50,670][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:50,999][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:51,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:52,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:52,794][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:52,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:52,797][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:53,769][__main__][INFO] - Iteration 189 took 23s (38.88% Gen, 56.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 35s. Estimated total time: 19h 19m 40s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 16s. [2025-11-13 09:10:53,771][__main__][INFO] - Starting iteration 189. [2025-11-13 09:10:53,774][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:53,775][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:02,753][__main__][INFO] - Number of regex retries in iteration 189: 0 [2025-11-13 09:11:02,754][__main__][INFO] - agents played in iteration 189 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:11:03,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:03,318][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:04,336][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:04,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:05,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:05,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:06,950][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:07,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:07,603][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:08,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:08,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:09,563][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:09,891][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:10,543][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:11,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:11,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:11,848][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:12,498][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:13,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:13,800][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:14,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:14,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:15,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:15,923][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:15,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:15,926][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:16,872][__main__][INFO] - Iteration 190 took 23s (38.88% Gen, 57.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 7m 28s. Estimated total time: 19h 14m 55s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 29s. [2025-11-13 09:11:16,874][__main__][INFO] - Starting iteration 190. [2025-11-13 09:11:16,876][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:11:16,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:25,786][__main__][INFO] - Number of regex retries in iteration 190: 0 [2025-11-13 09:11:25,787][__main__][INFO] - agents played in iteration 190 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:11:26,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:26,346][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:27,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:27,361][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:27,692][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:28,019][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:28,672][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:29,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:29,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:29,983][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:30,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:30,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:30,962][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:31,614][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:31,941][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:32,269][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:32,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:33,580][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:34,238][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:34,563][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:34,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:35,213][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:35,865][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:36,192][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:36,519][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:36,844][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:37,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:38,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:38,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:38,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:38,947][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:40,877][__main__][INFO] - Iteration 191 took 24s (37.12% Gen, 54.83% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 52m 13s. Estimated total time: 20h 0m 4s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 0s, 500 more iterations: 3h 20m 0s. [2025-11-13 09:11:40,879][__main__][INFO] - Starting iteration 191. [2025-11-13 09:11:40,882][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:11:40,882][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:49,286][__main__][INFO] - Number of regex retries in iteration 191: 0 [2025-11-13 09:11:49,287][__main__][INFO] - agents played in iteration 191 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:11:49,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:49,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:50,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:51,188][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:51,514][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:51,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:53,151][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:53,477][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:53,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:54,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:54,455][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:54,782][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:55,110][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:55,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:55,764][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:56,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:56,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:56,754][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:57,079][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:57,407][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:57,734][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:59,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:59,693][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:00,018][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:00,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:01,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:01,700][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:02,428][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:02,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:02,431][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:03,392][__main__][INFO] - Iteration 192 took 22s (37.33% Gen, 58.39% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 20s. Estimated total time: 18h 45m 34s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 31s, 500 more iterations: 3h 7m 35s. [2025-11-13 09:12:03,395][__main__][INFO] - Starting iteration 192. [2025-11-13 09:12:03,398][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:03,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:12,325][__main__][INFO] - Number of regex retries in iteration 192: 0 [2025-11-13 09:12:12,325][__main__][INFO] - agents played in iteration 192 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:12:12,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,880][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:12,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:13,607][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:13,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:14,232][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:14,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:15,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:15,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:15,867][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:16,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:16,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:16,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:17,499][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:17,826][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:18,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:18,481][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:19,462][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:19,792][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:20,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:21,100][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:22,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:23,064][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:23,390][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:24,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:24,737][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:25,479][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:25,481][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:25,482][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:26,445][__main__][INFO] - Iteration 193 took 23s (38.73% Gen, 57.09% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 46s. Estimated total time: 19h 12m 24s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 4s. [2025-11-13 09:12:26,447][__main__][INFO] - Starting iteration 193. [2025-11-13 09:12:26,451][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:26,451][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:35,267][__main__][INFO] - Number of regex retries in iteration 193: 0 [2025-11-13 09:12:35,267][__main__][INFO] - agents played in iteration 193 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:12:35,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,825][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:35,826][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:36,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:36,849][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:37,176][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:37,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:38,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:38,485][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:38,814][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:39,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:39,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:39,798][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:40,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:40,452][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:40,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:41,115][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:41,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:41,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:42,416][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:42,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:43,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:43,404][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:43,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:44,060][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:44,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:44,719][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:45,048][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:45,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:45,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:46,026][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:46,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:47,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:47,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:48,454][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:48,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:48,457][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:49,466][__main__][INFO] - Iteration 194 took 23s (38.30% Gen, 57.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 1m 46s. Estimated total time: 19h 10m 47s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 47s. [2025-11-13 09:12:49,468][__main__][INFO] - Starting iteration 194. [2025-11-13 09:12:49,471][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:49,472][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:58,060][__main__][INFO] - Number of regex retries in iteration 194: 0 [2025-11-13 09:12:58,060][__main__][INFO] - agents played in iteration 194 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:12:58,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:58,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:58,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:58,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:58,605][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:58,606][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:59,335][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:59,632][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:00,937][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:01,262][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:02,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:02,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:02,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:03,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:03,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:04,194][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:04,519][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:05,171][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:05,496][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:06,149][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:06,476][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:06,803][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:07,131][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:07,457][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:07,784][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:08,111][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:08,764][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:09,090][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:09,415][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:09,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:10,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:11,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:11,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:11,156][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:12,144][__main__][INFO] - Iteration 195 took 22s (37.88% Gen, 57.76% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 17s. Estimated total time: 18h 53m 40s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 47s, 500 more iterations: 3h 8m 56s. [2025-11-13 09:13:12,146][__main__][INFO] - Starting iteration 195. [2025-11-13 09:13:12,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:12,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:21,237][__main__][INFO] - Number of regex retries in iteration 195: 0 [2025-11-13 09:13:21,237][__main__][INFO] - agents played in iteration 195 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:13:21,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:21,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:21,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:21,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:21,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:21,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:22,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:22,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:23,146][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:23,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:24,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:24,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:24,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:25,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:26,089][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:26,415][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:26,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:27,070][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:27,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:28,050][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:28,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:29,027][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:29,354][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:30,005][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:30,333][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:30,660][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:30,986][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:31,312][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:32,290][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:32,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:32,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:33,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:34,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:34,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:34,354][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:35,348][__main__][INFO] - Iteration 196 took 23s (39.17% Gen, 56.54% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 10m 12s. Estimated total time: 19h 19m 59s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 19s. [2025-11-13 09:13:35,351][__main__][INFO] - Starting iteration 196. [2025-11-13 09:13:35,354][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:35,354][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:44,377][__main__][INFO] - Number of regex retries in iteration 196: 0 [2025-11-13 09:13:44,378][__main__][INFO] - agents played in iteration 196 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:13:44,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:44,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:44,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:44,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:44,925][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:44,925][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:45,652][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:45,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:46,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:46,611][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:48,895][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:49,549][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:49,875][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:50,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:50,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:51,177][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:51,503][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:52,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:52,809][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:53,134][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:53,460][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:53,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:54,114][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:54,443][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:54,767][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:55,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:56,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:56,768][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:57,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:57,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:57,511][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:58,532][__main__][INFO] - Iteration 197 took 23s (38.93% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 48s. Estimated total time: 19h 18m 58s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 9s. [2025-11-13 09:13:58,534][__main__][INFO] - Starting iteration 197. [2025-11-13 09:13:58,538][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:58,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:07,385][__main__][INFO] - Number of regex retries in iteration 197: 0 [2025-11-13 09:14:07,386][__main__][INFO] - agents played in iteration 197 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:14:07,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:07,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:07,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:07,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:07,939][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:07,940][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:08,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:08,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:09,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:10,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:10,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:10,938][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:11,266][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:11,591][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:11,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:12,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:12,572][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:12,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:13,223][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:13,876][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:14,202][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:14,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:15,177][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:16,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:16,808][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:17,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:17,785][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:18,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:18,437][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:19,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:19,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:20,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:20,527][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:20,529][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:21,500][__main__][INFO] - Iteration 198 took 22s (38.53% Gen, 57.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 57m 35s. Estimated total time: 19h 8m 7s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 21s. [2025-11-13 09:14:21,501][__main__][INFO] - Starting iteration 198. [2025-11-13 09:14:21,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:21,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:30,740][__main__][INFO] - Number of regex retries in iteration 198: 0 [2025-11-13 09:14:30,741][__main__][INFO] - agents played in iteration 198 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:14:31,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:31,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:32,026][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:32,323][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:32,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:32,977][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:33,304][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:33,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:34,284][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:34,609][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:35,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:36,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:36,889][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:37,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:37,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:37,868][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:38,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:38,847][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:39,174][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:39,501][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:40,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:40,477][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:40,803][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:41,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:41,455][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:41,780][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:42,108][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:42,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:43,118][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:43,844][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:43,845][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:43,847][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:44,827][__main__][INFO] - Iteration 199 took 23s (39.60% Gen, 56.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 12s. Estimated total time: 19h 26m 8s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 21s. [2025-11-13 09:14:44,829][__main__][INFO] - Starting iteration 199. [2025-11-13 09:14:44,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:44,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:53,818][__main__][INFO] - Number of regex retries in iteration 199: 0 [2025-11-13 09:14:53,819][__main__][INFO] - agents played in iteration 199 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:14:54,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:54,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:54,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:54,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:54,382][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:54,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:55,127][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:55,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:56,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:57,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:57,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:57,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:58,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:58,697][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:59,023][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:59,351][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:59,679][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:00,005][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:00,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:00,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:00,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:01,311][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:01,962][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:02,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:02,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:02,941][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:03,266][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:03,593][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:03,918][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:04,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:04,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:04,896][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:05,221][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:05,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:06,228][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:06,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:06,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:06,973][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:07,975][__main__][INFO] - Iteration 200 took 23s (38.83% Gen, 56.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 5m 52s. Estimated total time: 19h 17m 11s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 51s. [2025-11-13 09:15:07,977][__main__][INFO] - Starting iteration 200. [2025-11-13 09:15:07,981][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:15:07,981][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:17,117][__main__][INFO] - Number of regex retries in iteration 200: 0 [2025-11-13 09:15:17,117][__main__][INFO] - agents played in iteration 200 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:15:17,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:17,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:17,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:17,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:17,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:17,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:18,406][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:18,702][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:19,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:19,358][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:20,337][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:20,663][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:21,317][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:21,969][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:22,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:22,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:22,946][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:23,271][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:23,599][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:24,249][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:24,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:24,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:25,229][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:25,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:26,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:26,533][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:26,858][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:27,183][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:27,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:28,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:28,489][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:28,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:29,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:30,439][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:30,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:30,443][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:32,361][__main__][INFO] - Iteration 201 took 24s (37.47% Gen, 54.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 21s. Estimated total time: 20h 19m 4s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 38s, 500 more iterations: 3h 23m 10s. [2025-11-13 09:15:32,363][__main__][INFO] - Starting iteration 201. [2025-11-13 09:15:32,367][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:32,367][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:41,897][__main__][INFO] - Number of regex retries in iteration 201: 0 [2025-11-13 09:15:41,898][__main__][INFO] - agents played in iteration 201 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:15:42,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,455][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:42,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:43,486][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:44,139][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:44,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:44,796][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:45,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:45,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:45,779][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:46,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:46,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:47,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:48,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:48,394][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:48,724][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:49,052][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:49,378][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:49,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:50,029][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:50,682][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:51,338][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:51,664][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:52,319][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:52,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:53,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:53,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:54,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:55,097][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:55,098][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:55,100][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:56,120][__main__][INFO] - Iteration 202 took 23s (40.12% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 35m 35s. Estimated total time: 19h 47m 42s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 57s. [2025-11-13 09:15:56,123][__main__][INFO] - Starting iteration 202. [2025-11-13 09:15:56,126][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:56,127][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:05,547][__main__][INFO] - Number of regex retries in iteration 202: 0 [2025-11-13 09:16:05,548][__main__][INFO] - agents played in iteration 202 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:16:06,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:06,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:06,829][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:08,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:08,760][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:09,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:09,736][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:10,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:10,720][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:11,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:12,027][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:12,354][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:12,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:13,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:13,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:13,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:14,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:15,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:16,273][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:16,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:17,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:17,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:18,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:18,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:18,755][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:19,704][__main__][INFO] - Iteration 203 took 23s (39.95% Gen, 56.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 26s. Estimated total time: 19h 38m 56s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 29s. [2025-11-13 09:16:19,706][__main__][INFO] - Starting iteration 203. [2025-11-13 09:16:19,709][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:19,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:28,783][__main__][INFO] - Number of regex retries in iteration 203: 0 [2025-11-13 09:16:28,784][__main__][INFO] - agents played in iteration 203 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:16:29,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:29,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:30,109][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:30,736][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:31,063][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:31,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:32,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:32,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:33,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:33,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:33,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:34,008][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:34,335][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:34,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:35,965][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:36,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:36,620][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:37,605][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:38,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:39,245][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:39,898][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:40,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:41,279][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:42,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:42,021][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:42,023][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:43,000][__main__][INFO] - Iteration 204 took 23s (38.96% Gen, 56.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 11m 41s. Estimated total time: 19h 24m 35s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 5s. [2025-11-13 09:16:43,002][__main__][INFO] - Starting iteration 204. [2025-11-13 09:16:43,005][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:43,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:52,529][__main__][INFO] - Number of regex retries in iteration 204: 0 [2025-11-13 09:16:52,530][__main__][INFO] - agents played in iteration 204 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:16:52,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,089][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:53,089][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:53,809][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:54,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:54,756][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:55,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:56,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:57,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:57,363][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:57,688][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:58,342][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:58,997][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:59,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:59,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:59,973][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:00,298][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:00,625][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:01,276][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:01,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:01,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:02,252][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:02,577][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:02,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:03,228][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:03,554][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:03,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:04,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:04,919][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:05,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:05,665][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:05,666][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:06,644][__main__][INFO] - Iteration 205 took 23s (40.29% Gen, 55.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 40s. Estimated total time: 19h 41m 58s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 59s. [2025-11-13 09:17:06,646][__main__][INFO] - Starting iteration 205. [2025-11-13 09:17:06,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:06,649][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:16,052][__main__][INFO] - Number of regex retries in iteration 205: 0 [2025-11-13 09:17:16,053][__main__][INFO] - agents played in iteration 205 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:17:16,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:16,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:17,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:17,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:17,978][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:18,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:18,628][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:18,953][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:19,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:19,932][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:20,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:20,583][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:20,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:21,232][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:21,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:21,885][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:22,541][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:22,868][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:23,193][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:23,519][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:24,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:24,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:24,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:25,153][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:25,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:26,130][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:26,457][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:27,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:27,435][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:27,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:28,491][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:29,241][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:29,242][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:29,245][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:30,239][__main__][INFO] - Iteration 206 took 23s (39.86% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 52s. Estimated total time: 19h 39m 33s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 35s. [2025-11-13 09:17:30,241][__main__][INFO] - Starting iteration 206. [2025-11-13 09:17:30,245][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:30,245][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:39,811][__main__][INFO] - Number of regex retries in iteration 206: 0 [2025-11-13 09:17:39,812][__main__][INFO] - agents played in iteration 206 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:17:40,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,371][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:40,372][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:41,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:41,393][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:41,722][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:42,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:42,710][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:43,363][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:43,690][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:44,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:44,342][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:44,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:44,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:45,327][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:45,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:47,287][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:47,613][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:48,264][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:48,589][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:48,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:49,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:49,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:50,548][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:50,874][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:51,199][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:51,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:52,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:52,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:52,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:52,986][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:53,940][__main__][INFO] - Iteration 207 took 23s (40.37% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 30m 44s. Estimated total time: 19h 44m 49s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 28s. [2025-11-13 09:17:53,942][__main__][INFO] - Starting iteration 207. [2025-11-13 09:17:53,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:53,945][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:02,810][__main__][INFO] - Number of regex retries in iteration 207: 0 [2025-11-13 09:18:02,811][__main__][INFO] - agents played in iteration 207 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:18:03,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:03,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:04,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:04,396][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:05,374][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:05,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:06,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:06,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:06,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:07,008][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:07,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:07,987][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:09,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:09,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:09,942][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:10,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:10,919][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:11,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:11,570][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:12,225][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:12,880][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:13,205][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:13,856][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:14,181][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:14,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:15,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:15,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:15,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:15,960][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:16,938][__main__][INFO] - Iteration 208 took 22s (38.56% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 55m 15s. Estimated total time: 19h 9m 43s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 37s. [2025-11-13 09:18:16,940][__main__][INFO] - Starting iteration 208. [2025-11-13 09:18:16,943][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:16,943][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:26,489][__main__][INFO] - Number of regex retries in iteration 208: 0 [2025-11-13 09:18:26,490][__main__][INFO] - agents played in iteration 208 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:18:26,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:26,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:27,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:27,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:28,405][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:28,733][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:29,713][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:30,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:30,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:31,021][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:31,676][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:32,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:32,329][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:32,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:32,982][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:33,310][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:33,633][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:33,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:34,288][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:34,614][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:35,266][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:35,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:36,247][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:36,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:37,229][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:37,884][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:38,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:38,926][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:39,666][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:39,668][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:39,670][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:40,611][__main__][INFO] - Iteration 209 took 23s (40.33% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 34s. Estimated total time: 19h 43m 25s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 14s. [2025-11-13 09:18:40,613][__main__][INFO] - Starting iteration 209. [2025-11-13 09:18:40,616][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:40,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:50,141][__main__][INFO] - Number of regex retries in iteration 209: 0 [2025-11-13 09:18:50,142][__main__][INFO] - agents played in iteration 209 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:18:50,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,731][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:50,731][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:51,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:53,379][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:54,031][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:54,356][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:54,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:55,009][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:55,334][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:56,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:56,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:56,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:57,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:57,944][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:58,270][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:58,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:59,253][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:59,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:00,556][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:00,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:01,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:01,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:02,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:03,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:03,323][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:03,324][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:04,256][__main__][INFO] - Iteration 210 took 23s (40.29% Gen, 55.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 47s. Estimated total time: 19h 42m 2s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 0s. [2025-11-13 09:19:04,259][__main__][INFO] - Starting iteration 210. [2025-11-13 09:19:04,261][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:19:04,262][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:13,729][__main__][INFO] - Number of regex retries in iteration 210: 0 [2025-11-13 09:19:13,730][__main__][INFO] - agents played in iteration 210 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:19:14,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,289][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:14,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:15,001][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:15,298][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:15,626][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:15,954][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:16,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:16,932][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:17,261][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:17,593][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:17,921][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:18,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:18,571][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:18,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:19,888][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:20,537][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:21,188][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:22,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:22,820][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:23,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:24,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:24,453][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:24,778][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:25,107][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:25,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:26,150][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:26,871][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:26,873][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:26,874][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:28,749][__main__][INFO] - Iteration 211 took 24s (38.66% Gen, 53.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 8m 45s. Estimated total time: 20h 24m 24s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 48s, 500 more iterations: 3h 24m 4s. [2025-11-13 09:19:28,750][__main__][INFO] - Starting iteration 211. [2025-11-13 09:19:28,753][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:28,754][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:38,560][__main__][INFO] - Number of regex retries in iteration 211: 0 [2025-11-13 09:19:38,561][__main__][INFO] - agents played in iteration 211 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:19:39,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:39,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:39,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:39,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:39,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:39,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:40,447][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:40,773][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:41,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:41,756][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:42,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:42,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:42,736][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:43,061][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:43,387][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:43,714][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:44,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:45,667][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:46,318][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:46,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:46,975][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:47,301][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:47,628][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:48,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:48,611][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:48,937][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:49,264][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:49,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:49,920][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:50,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:50,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:51,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:51,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:51,706][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:52,578][__main__][INFO] - Iteration 212 took 23s (41.16% Gen, 55.17% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 35m 13s. Estimated total time: 19h 51m 17s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 32s. [2025-11-13 09:19:52,580][__main__][INFO] - Starting iteration 212. [2025-11-13 09:19:52,583][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:52,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:02,259][__main__][INFO] - Number of regex retries in iteration 212: 0 [2025-11-13 09:20:02,260][__main__][INFO] - agents played in iteration 212 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:20:02,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,860][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:02,860][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:03,559][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:03,854][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:04,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:04,507][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:05,166][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:05,490][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:05,814][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:06,139][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:06,465][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:06,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:08,093][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:08,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:09,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:09,722][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:10,373][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:10,698][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:11,023][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:11,675][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:12,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:12,328][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:12,656][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:12,981][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:13,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:13,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:13,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:14,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:15,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:15,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:15,405][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:16,307][__main__][INFO] - Iteration 213 took 23s (40.78% Gen, 55.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 47s. Estimated total time: 19h 46m 14s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 42s. [2025-11-13 09:20:16,308][__main__][INFO] - Starting iteration 213. [2025-11-13 09:20:16,311][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:16,312][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:25,682][__main__][INFO] - Number of regex retries in iteration 213: 0 [2025-11-13 09:20:25,683][__main__][INFO] - agents played in iteration 213 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:20:26,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:26,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:26,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:28,255][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:28,581][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:28,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:29,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:29,573][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:29,900][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:30,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:31,209][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:31,533][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:31,861][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:32,512][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:32,837][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:33,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:33,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:34,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:34,466][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:35,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:35,447][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:35,774][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:36,100][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:36,427][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:37,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:38,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:38,870][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:38,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:38,873][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:39,799][__main__][INFO] - Iteration 214 took 23s (39.90% Gen, 56.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 35s. Estimated total time: 19h 34m 25s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 44s. [2025-11-13 09:20:39,801][__main__][INFO] - Starting iteration 214. [2025-11-13 09:20:39,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:39,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:49,431][__main__][INFO] - Number of regex retries in iteration 214: 0 [2025-11-13 09:20:49,432][__main__][INFO] - agents played in iteration 214 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:20:49,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,996][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:49,997][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:50,668][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:50,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:51,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:52,278][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:52,606][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:52,932][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:54,563][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:54,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:55,540][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:55,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:56,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:56,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:56,843][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:57,173][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:57,498][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:57,824][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:58,149][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:58,475][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:58,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:59,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:59,454][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:59,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:00,105][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:00,430][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:01,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:01,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:02,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:02,538][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:02,539][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:03,426][__main__][INFO] - Iteration 215 took 23s (40.76% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 23m 53s. Estimated total time: 19h 41m 7s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 51s. [2025-11-13 09:21:03,428][__main__][INFO] - Starting iteration 215. [2025-11-13 09:21:03,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:03,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:12,650][__main__][INFO] - Number of regex retries in iteration 215: 0 [2025-11-13 09:21:12,651][__main__][INFO] - agents played in iteration 215 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:21:13,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:13,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:13,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:13,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:13,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:13,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:13,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:14,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:14,546][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:14,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:15,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:17,153][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:17,478][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:17,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:18,457][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:18,785][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:19,439][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:19,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:20,434][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:21,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:21,413][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:21,738][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:22,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:22,715][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:23,368][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:24,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:25,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:25,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:25,806][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:25,808][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:26,751][__main__][INFO] - Iteration 216 took 23s (39.53% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 27s. Estimated total time: 19h 26m 4s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 20s. [2025-11-13 09:21:26,753][__main__][INFO] - Starting iteration 216. [2025-11-13 09:21:26,756][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:26,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:36,530][__main__][INFO] - Number of regex retries in iteration 216: 0 [2025-11-13 09:21:36,531][__main__][INFO] - agents played in iteration 216 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:21:36,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:37,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:37,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:37,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:37,086][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:37,086][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:37,783][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:38,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:38,733][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:39,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:40,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:40,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:40,690][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:41,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:41,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:41,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:42,652][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:42,979][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:43,305][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:43,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:44,283][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:44,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:44,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:45,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:46,241][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:46,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:46,894][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:47,220][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:48,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:48,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:49,643][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:49,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:49,646][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:50,550][__main__][INFO] - Iteration 217 took 23s (41.07% Gen, 55.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 31m 45s. Estimated total time: 19h 49m 46s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 17s. [2025-11-13 09:21:50,553][__main__][INFO] - Starting iteration 217. [2025-11-13 09:21:50,555][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:50,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:59,915][__main__][INFO] - Number of regex retries in iteration 217: 0 [2025-11-13 09:21:59,916][__main__][INFO] - agents played in iteration 217 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:22:00,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:00,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:01,192][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:01,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:04,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:04,428][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:05,418][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:05,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:06,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:06,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:07,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:07,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:07,702][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:08,028][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:08,355][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:08,686][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:09,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:09,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:10,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:10,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:10,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:11,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:11,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:12,380][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:13,109][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:13,110][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:13,112][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:14,000][__main__][INFO] - Iteration 218 took 23s (39.92% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 52s. Estimated total time: 19h 32m 16s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 22s. [2025-11-13 09:22:14,002][__main__][INFO] - Starting iteration 218. [2025-11-13 09:22:14,005][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:14,005][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:23,410][__main__][INFO] - Number of regex retries in iteration 218: 0 [2025-11-13 09:22:23,410][__main__][INFO] - agents played in iteration 218 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:22:23,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,965][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:23,965][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:24,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:24,982][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:25,308][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:25,635][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:26,290][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:26,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:26,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:27,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:27,595][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:27,920][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:28,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:28,572][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:28,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:29,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:29,553][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:29,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:30,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:30,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:31,186][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:31,513][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:32,164][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:32,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:33,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:34,118][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:34,445][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:35,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:35,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:36,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:36,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:36,538][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:37,452][__main__][INFO] - Iteration 219 took 23s (40.11% Gen, 55.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 35s. Estimated total time: 19h 32m 23s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 23s. [2025-11-13 09:22:37,454][__main__][INFO] - Starting iteration 219. [2025-11-13 09:22:37,457][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:37,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:46,774][__main__][INFO] - Number of regex retries in iteration 219: 0 [2025-11-13 09:22:46,775][__main__][INFO] - agents played in iteration 219 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:22:47,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:47,337][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:48,058][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:48,353][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:49,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:49,337][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:49,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:50,317][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:50,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:50,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:51,307][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:51,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:51,965][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:52,293][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:52,618][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:52,943][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:53,596][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:53,922][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:54,248][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:54,573][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:55,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:55,551][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:56,203][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:57,511][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:58,169][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:58,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:59,240][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:59,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:59,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:59,938][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:00,816][__main__][INFO] - Iteration 220 took 23s (39.89% Gen, 56.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 49s. Estimated total time: 19h 28m 1s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 40s. [2025-11-13 09:23:00,818][__main__][INFO] - Starting iteration 220. [2025-11-13 09:23:00,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:23:00,822][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:09,408][__main__][INFO] - Number of regex retries in iteration 220: 0 [2025-11-13 09:23:09,408][__main__][INFO] - agents played in iteration 220 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:23:09,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,962][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:09,962][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:10,967][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:11,294][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:12,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:12,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:13,256][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:13,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:14,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:14,886][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:15,213][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:16,520][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:16,847][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:17,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:17,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:18,482][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:19,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:19,462][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:19,789][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:20,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:20,769][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:21,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:21,820][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:22,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:22,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:22,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:24,328][__main__][INFO] - Iteration 221 took 23s (36.53% Gen, 55.93% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 15m 46s. Estimated total time: 19h 35m 21s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 53s. [2025-11-13 09:23:24,329][__main__][INFO] - Starting iteration 221. [2025-11-13 09:23:24,332][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:24,333][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:34,189][__main__][INFO] - Number of regex retries in iteration 221: 0 [2025-11-13 09:23:34,190][__main__][INFO] - agents played in iteration 221 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:23:34,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,741][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:34,742][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:35,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:36,408][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:37,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:37,714][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:38,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:38,364][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:38,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:39,017][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:39,343][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:39,670][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:39,994][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:40,319][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:40,645][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:40,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:41,304][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:41,961][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:42,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:42,615][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:42,943][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:43,597][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:43,924][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:44,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:44,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:44,906][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:45,561][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:45,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:46,635][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:47,337][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:47,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:47,340][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:48,254][__main__][INFO] - Iteration 222 took 23s (41.20% Gen, 54.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 36m 8s. Estimated total time: 19h 56m 7s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 21s. [2025-11-13 09:23:48,256][__main__][INFO] - Starting iteration 222. [2025-11-13 09:23:48,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:48,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:57,624][__main__][INFO] - Number of regex retries in iteration 222: 0 [2025-11-13 09:23:57,625][__main__][INFO] - agents played in iteration 222 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:23:58,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:58,180][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:58,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:59,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:59,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:00,160][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:00,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:01,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:01,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:02,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:02,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:03,418][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:03,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:04,072][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:04,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:05,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:05,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:05,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:06,030][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:06,356][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:06,682][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:07,007][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:07,332][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:07,658][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:07,983][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:08,639][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:08,965][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:09,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:10,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:10,719][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:10,721][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:10,722][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:11,621][__main__][INFO] - Iteration 223 took 23s (40.09% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 7m 46s. Estimated total time: 19h 28m 9s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 41s. [2025-11-13 09:24:11,622][__main__][INFO] - Starting iteration 223. [2025-11-13 09:24:11,626][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:11,626][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:21,168][__main__][INFO] - Number of regex retries in iteration 223: 0 [2025-11-13 09:24:21,169][__main__][INFO] - agents played in iteration 223 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:24:21,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,724][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:21,725][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:22,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:22,730][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:23,056][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:23,386][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:24,037][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:24,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:24,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:25,015][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:25,343][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:25,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:26,646][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:26,971][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:27,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:27,623][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:28,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:28,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:29,254][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:29,580][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:29,905][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:30,232][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:31,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:31,539][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:31,866][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:32,519][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:32,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:33,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:34,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:34,303][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:34,304][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:35,194][__main__][INFO] - Iteration 224 took 23s (40.48% Gen, 55.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 43s. Estimated total time: 19h 38m 29s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 24s. [2025-11-13 09:24:35,196][__main__][INFO] - Starting iteration 224. [2025-11-13 09:24:35,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:35,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:44,587][__main__][INFO] - Number of regex retries in iteration 224: 0 [2025-11-13 09:24:44,588][__main__][INFO] - agents played in iteration 224 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:24:45,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,145][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:45,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:45,871][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:46,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:47,148][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:47,474][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:47,802][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:48,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:48,456][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:48,782][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:49,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:49,435][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:50,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:50,740][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:51,067][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:51,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:52,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:52,696][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:53,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:53,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:53,679][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:54,005][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:54,331][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:54,990][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:55,318][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:55,648][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:55,980][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:56,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:57,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:57,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:57,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:57,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:58,636][__main__][INFO] - Iteration 225 took 23s (40.06% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 10m 45s. Estimated total time: 19h 31m 55s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 19s. [2025-11-13 09:24:58,638][__main__][INFO] - Starting iteration 225. [2025-11-13 09:24:58,641][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:58,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:08,261][__main__][INFO] - Number of regex retries in iteration 225: 0 [2025-11-13 09:25:08,262][__main__][INFO] - agents played in iteration 225 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:25:08,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:08,825][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:09,540][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:10,168][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:10,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:10,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:11,151][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:11,477][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:11,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:12,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:13,444][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:13,772][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:14,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:14,432][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:14,761][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:15,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:15,748][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:16,074][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:16,730][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:17,055][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:17,381][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:17,709][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:18,040][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:18,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:19,021][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:20,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:20,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:21,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:21,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:21,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:22,351][__main__][INFO] - Iteration 226 took 23s (40.57% Gen, 55.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 23m 59s. Estimated total time: 19h 45m 32s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 35s. [2025-11-13 09:25:22,353][__main__][INFO] - Starting iteration 226. [2025-11-13 09:25:22,356][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:22,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:31,917][__main__][INFO] - Number of regex retries in iteration 226: 0 [2025-11-13 09:25:31,918][__main__][INFO] - agents played in iteration 226 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:25:32,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:32,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:32,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:32,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:32,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:32,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:33,773][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:34,099][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:34,426][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:34,753][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:35,078][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:35,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:36,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:36,386][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:37,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:38,019][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:38,344][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:38,671][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:39,322][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:39,647][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:39,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:40,299][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:40,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:41,277][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:41,931][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:42,258][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:42,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:42,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:43,239][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:43,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:44,293][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:44,986][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:44,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:44,989][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:45,871][__main__][INFO] - Iteration 227 took 23s (40.66% Gen, 55.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 50s. Estimated total time: 19h 35m 47s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 57s. [2025-11-13 09:25:45,873][__main__][INFO] - Starting iteration 227. [2025-11-13 09:25:45,875][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:45,876][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:55,611][__main__][INFO] - Number of regex retries in iteration 227: 0 [2025-11-13 09:25:55,612][__main__][INFO] - agents played in iteration 227 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:25:56,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:56,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:56,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:56,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:56,164][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:56,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:56,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:57,142][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:57,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:58,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:58,447][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:58,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:59,098][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:59,425][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:59,751][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:00,404][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:00,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:01,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:02,686][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:03,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:03,337][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:03,662][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:05,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:05,958][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:06,289][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:06,617][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:06,944][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:07,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:07,998][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:08,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:08,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:08,703][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:09,612][__main__][INFO] - Iteration 228 took 23s (41.01% Gen, 55.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 24m 31s. Estimated total time: 19h 46m 52s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 48s. [2025-11-13 09:26:09,613][__main__][INFO] - Starting iteration 228. [2025-11-13 09:26:09,616][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:09,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:18,923][__main__][INFO] - Number of regex retries in iteration 228: 0 [2025-11-13 09:26:18,924][__main__][INFO] - agents played in iteration 228 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:26:19,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:19,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:19,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:19,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:19,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:19,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:20,447][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:21,103][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:21,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:22,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:22,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:22,735][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:23,714][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:24,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:24,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:24,692][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:25,018][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:25,344][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:25,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:25,996][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:26,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:26,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:27,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:28,281][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:28,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:28,938][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:29,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:29,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:29,918][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:30,247][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:30,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:31,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:32,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:32,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:32,007][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:32,934][__main__][INFO] - Iteration 229 took 23s (39.91% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 3m 12s. Estimated total time: 19h 25m 55s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 19s. [2025-11-13 09:26:32,935][__main__][INFO] - Starting iteration 229. [2025-11-13 09:26:32,938][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:32,938][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:41,658][__main__][INFO] - Number of regex retries in iteration 229: 0 [2025-11-13 09:26:41,658][__main__][INFO] - agents played in iteration 229 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:26:42,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:42,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:42,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:43,236][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:43,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:44,222][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:44,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:45,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:45,851][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:46,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:47,151][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:47,478][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:48,456][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:48,783][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:49,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:50,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:50,756][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:51,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:51,411][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:51,740][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:53,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:53,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:54,121][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:54,823][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:54,824][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:54,826][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:55,829][__main__][INFO] - Iteration 230 took 22s (38.09% Gen, 57.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 41m 28s. Estimated total time: 19h 4m 34s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 45s. [2025-11-13 09:26:55,831][__main__][INFO] - Starting iteration 230. [2025-11-13 09:26:55,834][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:55,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:05,306][__main__][INFO] - Number of regex retries in iteration 230: 0 [2025-11-13 09:27:05,306][__main__][INFO] - agents played in iteration 230 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:27:05,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,871][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:05,872][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:07,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:07,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:08,530][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:09,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:09,833][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:10,488][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:10,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:11,140][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:11,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:11,795][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:12,448][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:12,775][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:13,100][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:13,427][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:13,755][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:14,082][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:15,716][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:16,046][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:16,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:17,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:17,756][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:18,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:18,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:18,473][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:20,300][__main__][INFO] - Iteration 231 took 24s (38.71% Gen, 53.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 59m 47s. Estimated total time: 20h 23m 19s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 46s, 500 more iterations: 3h 23m 53s. [2025-11-13 09:27:20,302][__main__][INFO] - Starting iteration 231. [2025-11-13 09:27:20,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:20,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:30,158][__main__][INFO] - Number of regex retries in iteration 231: 0 [2025-11-13 09:27:30,158][__main__][INFO] - agents played in iteration 231 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:27:30,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:30,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:31,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:31,721][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:32,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:32,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:33,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:33,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:34,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:34,340][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:34,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:35,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:35,986][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:36,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:36,647][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:36,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:37,296][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:37,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:38,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:38,933][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:39,262][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:39,591][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:39,917][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:40,243][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:40,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:40,899][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:41,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:41,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:41,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:42,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:43,325][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:43,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:43,328][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:44,296][__main__][INFO] - Iteration 232 took 23s (41.07% Gen, 54.89% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 35m 40s. Estimated total time: 19h 59m 36s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 56s. [2025-11-13 09:27:44,297][__main__][INFO] - Starting iteration 232. [2025-11-13 09:27:44,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:44,301][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:53,177][__main__][INFO] - Number of regex retries in iteration 232: 0 [2025-11-13 09:27:53,177][__main__][INFO] - agents played in iteration 232 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:27:53,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,735][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:53,735][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:55,402][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:56,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:56,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:57,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:57,369][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:57,695][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:58,022][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:58,348][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:58,677][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:59,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:59,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:00,313][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:00,639][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:00,964][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:01,289][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:01,617][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:02,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:02,601][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:02,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:03,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:03,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:04,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:04,571][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:04,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:05,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:06,330][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:06,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:06,333][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:07,222][__main__][INFO] - Iteration 233 took 22s (38.72% Gen, 57.39% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 41m 50s. Estimated total time: 19h 6m 8s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 1s. [2025-11-13 09:28:07,224][__main__][INFO] - Starting iteration 233. [2025-11-13 09:28:07,227][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:07,227][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:16,753][__main__][INFO] - Number of regex retries in iteration 233: 0 [2025-11-13 09:28:16,754][__main__][INFO] - agents played in iteration 233 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:28:17,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:17,298][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:17,996][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:18,623][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:18,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:19,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:19,613][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:19,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:20,274][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:20,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:20,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:21,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:21,593][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:22,248][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:22,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:23,226][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:23,553][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:23,881][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:24,208][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:24,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:24,862][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:25,189][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:25,517][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:25,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:26,496][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:26,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:27,151][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:28,133][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:28,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:29,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:29,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:29,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:29,950][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:30,831][__main__][INFO] - Iteration 234 took 23s (40.36% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 32s. Estimated total time: 19h 40m 14s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 42s. [2025-11-13 09:28:30,833][__main__][INFO] - Starting iteration 234. [2025-11-13 09:28:30,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:30,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:39,752][__main__][INFO] - Number of regex retries in iteration 234: 0 [2025-11-13 09:28:39,752][__main__][INFO] - agents played in iteration 234 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:28:40,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:40,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:41,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:41,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:41,659][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:41,986][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:42,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:43,292][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:43,950][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:44,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:44,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:45,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:45,582][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:45,908][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:46,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:46,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:47,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:48,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:48,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:48,851][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:49,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:49,505][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:49,833][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:50,817][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:51,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:51,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:52,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:52,892][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:52,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:52,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:53,787][__main__][INFO] - Iteration 235 took 22s (38.85% Gen, 57.26% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 42m 33s. Estimated total time: 19h 7m 37s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 16s. [2025-11-13 09:28:53,789][__main__][INFO] - Starting iteration 235. [2025-11-13 09:28:53,792][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:53,792][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:02,838][__main__][INFO] - Number of regex retries in iteration 235: 0 [2025-11-13 09:29:02,838][__main__][INFO] - agents played in iteration 235 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:29:03,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:03,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:04,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:04,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:05,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:05,712][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:06,364][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:06,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:07,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:07,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:07,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:07,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:08,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:08,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:08,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:09,300][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:09,956][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:10,284][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:10,610][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:10,935][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:11,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:11,924][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:12,581][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:13,559][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:13,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:14,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:14,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:15,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:15,985][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:15,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:15,988][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:16,866][__main__][INFO] - Iteration 236 took 23s (39.20% Gen, 56.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 16s. Estimated total time: 19h 13m 44s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 17s. [2025-11-13 09:29:16,867][__main__][INFO] - Starting iteration 236. [2025-11-13 09:29:16,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:16,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:25,846][__main__][INFO] - Number of regex retries in iteration 236: 0 [2025-11-13 09:29:25,847][__main__][INFO] - agents played in iteration 236 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:29:26,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:26,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:27,127][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:27,423][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:28,079][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:28,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:28,732][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:29,059][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:29,386][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:29,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:30,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:30,365][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:30,690][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:31,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:32,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:33,628][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:33,956][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:34,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:34,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:34,943][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:35,600][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:35,926][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:36,257][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:36,586][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:36,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:37,241][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:37,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:38,302][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:39,025][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:39,027][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:39,028][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:39,906][__main__][INFO] - Iteration 237 took 23s (38.96% Gen, 57.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 45m 56s. Estimated total time: 19h 11m 47s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 57s. [2025-11-13 09:29:39,908][__main__][INFO] - Starting iteration 237. [2025-11-13 09:29:39,911][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:39,912][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:49,217][__main__][INFO] - Number of regex retries in iteration 237: 0 [2025-11-13 09:29:49,218][__main__][INFO] - agents played in iteration 237 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:29:49,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:49,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:50,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:50,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:51,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:51,802][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:52,129][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:52,454][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:52,781][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:53,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:53,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:53,765][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:54,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:55,406][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:56,721][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:57,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:57,702][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:58,028][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:58,356][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:59,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:59,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:59,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:00,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:00,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:00,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:01,689][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:02,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:02,395][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:02,398][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:03,367][__main__][INFO] - Iteration 238 took 23s (39.67% Gen, 56.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 6m 34s. Estimated total time: 19h 32m 49s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 28s. [2025-11-13 09:30:03,369][__main__][INFO] - Starting iteration 238. [2025-11-13 09:30:03,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:03,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:12,425][__main__][INFO] - Number of regex retries in iteration 238: 0 [2025-11-13 09:30:12,426][__main__][INFO] - agents played in iteration 238 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:30:12,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:12,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:12,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:12,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:12,992][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:12,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:13,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:14,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:15,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:15,644][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:15,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:16,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:16,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:17,274][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:17,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:17,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:18,258][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:18,586][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:18,914][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:19,241][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:19,570][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:19,899][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:20,551][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:21,209][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:21,535][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:21,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:22,846][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:23,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:23,829][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:24,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:24,877][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:25,586][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:25,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:25,589][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:26,506][__main__][INFO] - Iteration 239 took 23s (39.13% Gen, 56.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 4s. Estimated total time: 19h 16m 41s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 46s. [2025-11-13 09:30:26,508][__main__][INFO] - Starting iteration 239. [2025-11-13 09:30:26,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:26,511][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:35,651][__main__][INFO] - Number of regex retries in iteration 239: 0 [2025-11-13 09:30:35,652][__main__][INFO] - agents played in iteration 239 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:30:36,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,208][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:36,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:36,943][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:37,569][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:37,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:38,220][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:38,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:39,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:39,527][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:40,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:40,505][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:40,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:41,157][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:41,810][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:42,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:42,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:43,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:43,773][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:44,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:44,427][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:44,754][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:45,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:45,408][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:45,734][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:46,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:47,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:47,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:48,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:48,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:48,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:48,805][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:49,688][__main__][INFO] - Iteration 240 took 23s (39.43% Gen, 56.75% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 51m 53s. Estimated total time: 19h 18m 53s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 8s. [2025-11-13 09:30:49,690][__main__][INFO] - Starting iteration 240. [2025-11-13 09:30:49,693][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:49,694][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:58,100][__main__][INFO] - Number of regex retries in iteration 240: 0 [2025-11-13 09:30:58,101][__main__][INFO] - agents played in iteration 240 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:30:58,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:58,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:59,702][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:00,030][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:00,357][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:01,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:01,345][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:01,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:01,999][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:02,989][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:03,314][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:03,640][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:03,967][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:04,293][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:04,619][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:04,946][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:05,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:05,600][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:05,928][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:06,255][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:06,582][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:07,234][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:07,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:07,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:08,540][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:08,868][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:09,193][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:09,520][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:09,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:10,555][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:11,291][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:11,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:11,294][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:13,131][__main__][INFO] - Iteration 241 took 23s (35.87% Gen, 56.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 4m 31s. Estimated total time: 19h 31m 55s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 19s. [2025-11-13 09:31:13,132][__main__][INFO] - Starting iteration 241. [2025-11-13 09:31:13,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:13,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:22,163][__main__][INFO] - Number of regex retries in iteration 241: 0 [2025-11-13 09:31:22,164][__main__][INFO] - agents played in iteration 241 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:31:22,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:22,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:22,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:22,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:22,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:22,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:23,742][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:24,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:24,721][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:25,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:26,031][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:26,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:26,685][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:27,013][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:27,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:27,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:27,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:28,642][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:28,969][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:29,299][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:29,627][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:29,955][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:30,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:30,608][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:30,934][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:31,587][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:31,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:32,242][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:32,570][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:33,228][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:33,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:33,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:34,591][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:35,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:35,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:35,325][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:36,219][__main__][INFO] - Iteration 242 took 23s (39.11% Gen, 57.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 28s. Estimated total time: 19h 14m 15s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 22s. [2025-11-13 09:31:36,221][__main__][INFO] - Starting iteration 242. [2025-11-13 09:31:36,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:36,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:45,259][__main__][INFO] - Number of regex retries in iteration 242: 0 [2025-11-13 09:31:45,260][__main__][INFO] - agents played in iteration 242 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:31:45,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:45,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:46,586][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:46,882][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:47,536][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:47,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:48,191][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:48,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:48,848][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:49,172][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:49,501][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:49,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:50,154][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:50,480][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:50,806][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:51,133][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:51,789][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:53,101][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:53,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:54,413][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:54,746][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:55,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:55,406][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:56,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:56,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:57,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:57,765][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:58,492][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:58,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:58,495][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:59,428][__main__][INFO] - Iteration 243 took 23s (38.93% Gen, 57.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 6s. Estimated total time: 19h 20m 16s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 22s. [2025-11-13 09:31:59,430][__main__][INFO] - Starting iteration 243. [2025-11-13 09:31:59,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:59,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:08,320][__main__][INFO] - Number of regex retries in iteration 243: 0 [2025-11-13 09:32:08,321][__main__][INFO] - agents played in iteration 243 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:32:08,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:08,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:10,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:10,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:11,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:12,518][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:13,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:14,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:14,804][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:15,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:15,458][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:17,413][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:17,740][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:18,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:18,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:18,718][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:19,046][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:19,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:20,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:20,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:21,475][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:21,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:21,478][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:22,475][__main__][INFO] - Iteration 244 took 23s (38.57% Gen, 57.10% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 43m 30s. Estimated total time: 19h 12m 3s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 0s. [2025-11-13 09:32:22,477][__main__][INFO] - Starting iteration 244. [2025-11-13 09:32:22,479][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:22,480][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:31,410][__main__][INFO] - Number of regex retries in iteration 244: 0 [2025-11-13 09:32:31,410][__main__][INFO] - agents played in iteration 244 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:32:31,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:31,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:31,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:31,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:31,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:31,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:32,701][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:32,999][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:33,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:34,314][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:34,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:34,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:35,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:35,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:36,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:36,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:36,924][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:37,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:37,581][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:37,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:38,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:38,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:38,886][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:39,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:39,866][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:40,518][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:40,843][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:41,168][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:41,492][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:42,147][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:42,799][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:43,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:43,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:44,560][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:44,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:44,563][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:45,542][__main__][INFO] - Iteration 245 took 23s (38.72% Gen, 57.03% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 13s. Estimated total time: 19h 13m 9s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 11s. [2025-11-13 09:32:45,544][__main__][INFO] - Starting iteration 245. [2025-11-13 09:32:45,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:45,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:55,165][__main__][INFO] - Number of regex retries in iteration 245: 0 [2025-11-13 09:32:55,166][__main__][INFO] - agents played in iteration 245 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:32:55,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:55,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:55,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:55,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:55,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:55,733][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:56,438][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:56,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:57,065][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:58,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:59,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:59,353][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:59,680][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:00,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:01,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:01,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:02,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:02,944][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:03,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:03,596][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:04,250][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:04,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:04,902][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:05,229][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:05,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:06,536][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:06,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:07,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:08,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:08,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:08,284][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:09,175][__main__][INFO] - Iteration 246 took 23s (40.70% Gen, 55.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 8s. Estimated total time: 19h 41m 28s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 54s. [2025-11-13 09:33:09,178][__main__][INFO] - Starting iteration 246. [2025-11-13 09:33:09,180][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:09,181][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:18,202][__main__][INFO] - Number of regex retries in iteration 246: 0 [2025-11-13 09:33:18,202][__main__][INFO] - agents played in iteration 246 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:33:18,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:18,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:18,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:18,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:18,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:18,769][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:19,462][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:19,761][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:20,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:20,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:21,067][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:22,048][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:22,379][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:22,708][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:23,034][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:23,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:23,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:24,665][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:24,990][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:25,315][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:26,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:26,624][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:26,952][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:27,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:27,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:27,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:28,585][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:28,910][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:29,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:29,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:30,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:31,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:31,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:31,321][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:32,250][__main__][INFO] - Iteration 247 took 23s (39.10% Gen, 56.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 49s. Estimated total time: 19h 13m 32s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 15s. [2025-11-13 09:33:32,252][__main__][INFO] - Starting iteration 247. [2025-11-13 09:33:32,255][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:32,255][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:40,937][__main__][INFO] - Number of regex retries in iteration 247: 0 [2025-11-13 09:33:40,937][__main__][INFO] - agents played in iteration 247 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:33:41,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:41,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:41,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:41,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:41,488][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:41,489][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:42,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:43,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:44,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:45,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:45,760][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:46,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:46,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:47,071][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:47,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:47,731][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:48,060][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:48,713][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:49,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:50,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:52,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:52,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:53,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:54,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:54,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:54,060][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:54,939][__main__][INFO] - Iteration 248 took 22s (38.27% Gen, 57.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 24m 10s. Estimated total time: 18h 54m 16s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 48s, 500 more iterations: 3h 9m 2s. [2025-11-13 09:33:54,941][__main__][INFO] - Starting iteration 248. [2025-11-13 09:33:54,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:54,945][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:04,064][__main__][INFO] - Number of regex retries in iteration 248: 0 [2025-11-13 09:34:04,064][__main__][INFO] - agents played in iteration 248 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:34:04,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:04,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:04,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:04,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:04,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:04,620][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:05,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:07,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:08,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:08,569][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:09,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:09,885][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:10,210][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:10,535][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:10,861][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:11,187][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:11,515][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:12,503][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:12,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:13,158][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:13,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:14,143][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:14,469][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:15,121][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:15,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:16,468][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:17,172][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:17,173][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:17,174][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:18,063][__main__][INFO] - Iteration 249 took 23s (39.45% Gen, 56.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 29s. Estimated total time: 19h 15m 58s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 39s. [2025-11-13 09:34:18,065][__main__][INFO] - Starting iteration 249. [2025-11-13 09:34:18,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:18,068][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:27,246][__main__][INFO] - Number of regex retries in iteration 249: 0 [2025-11-13 09:34:27,246][__main__][INFO] - agents played in iteration 249 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:34:27,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:27,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:27,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:27,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:27,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:27,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:28,507][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:28,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:29,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:29,584][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:29,913][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:30,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:31,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:31,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:32,205][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:32,530][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:32,856][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:33,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:33,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:33,837][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:34,169][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:34,499][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:34,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:35,153][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:35,479][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:35,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:36,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:36,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:37,113][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:38,096][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:38,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:39,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:39,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:40,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:40,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:40,505][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:41,454][__main__][INFO] - Iteration 250 took 23s (39.24% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 58m 30s. Estimated total time: 19h 29m 22s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 53s. [2025-11-13 09:34:41,456][__main__][INFO] - Starting iteration 250. [2025-11-13 09:34:41,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:41,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:50,862][__main__][INFO] - Number of regex retries in iteration 250: 0 [2025-11-13 09:34:50,862][__main__][INFO] - agents played in iteration 250 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:34:51,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:51,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:51,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:51,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:51,421][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:51,422][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:52,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:52,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:52,753][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:53,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:53,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:54,061][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:54,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:54,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:55,044][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:55,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:55,700][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:56,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:56,687][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:57,012][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:57,338][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:57,663][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:57,991][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:58,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:58,975][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:59,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:59,953][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:00,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:00,607][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:01,260][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:01,911][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:02,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:03,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:03,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:03,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:03,997][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:05,862][__main__][INFO] - Iteration 251 took 24s (38.53% Gen, 53.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 48m 50s. Estimated total time: 20h 20m 7s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 40s, 500 more iterations: 3h 23m 21s. [2025-11-13 09:35:05,864][__main__][INFO] - Starting iteration 251. [2025-11-13 09:35:05,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:05,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:15,203][__main__][INFO] - Number of regex retries in iteration 251: 0 [2025-11-13 09:35:15,204][__main__][INFO] - agents played in iteration 251 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:35:15,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:15,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:15,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:15,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:15,749][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:15,749][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:16,453][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:16,748][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:17,075][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:17,403][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:18,391][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:18,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:19,048][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:19,709][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:20,036][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:20,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:21,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:21,351][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:21,678][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:22,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:22,662][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:23,319][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:24,299][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:24,630][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:25,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:25,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:25,945][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:26,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:26,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:27,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:28,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:28,352][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:28,354][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:29,255][__main__][INFO] - Iteration 252 took 23s (39.91% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 46s. Estimated total time: 19h 29m 26s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 54s. [2025-11-13 09:35:29,257][__main__][INFO] - Starting iteration 252. [2025-11-13 09:35:29,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:29,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:39,029][__main__][INFO] - Number of regex retries in iteration 252: 0 [2025-11-13 09:35:39,030][__main__][INFO] - agents played in iteration 252 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:35:39,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:39,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:39,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:39,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:39,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:39,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:40,295][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:40,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:40,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:41,248][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:41,576][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:41,907][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:42,891][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:43,219][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:43,547][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:43,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:44,202][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:44,528][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:44,854][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:45,179][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:45,505][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:45,830][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:46,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:46,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:46,806][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:47,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:47,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:47,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:48,439][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:48,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:49,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:49,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:50,078][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:50,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:51,444][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:52,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:52,137][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:52,139][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:53,012][__main__][INFO] - Iteration 253 took 23s (41.13% Gen, 55.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 35s. Estimated total time: 19h 47m 39s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 56s. [2025-11-13 09:35:53,014][__main__][INFO] - Starting iteration 253. [2025-11-13 09:35:53,016][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:53,017][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:02,515][__main__][INFO] - Number of regex retries in iteration 253: 0 [2025-11-13 09:36:02,516][__main__][INFO] - agents played in iteration 253 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:36:02,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:02,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:03,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:03,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:03,061][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:03,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:04,075][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:04,401][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:05,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:05,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:05,716][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:06,042][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:06,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:06,698][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:07,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:09,307][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:10,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:11,594][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:11,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:12,249][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:12,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:13,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:13,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:14,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:14,931][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:15,611][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:15,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:15,614][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:16,501][__main__][INFO] - Iteration 254 took 23s (40.44% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 1m 50s. Estimated total time: 19h 34m 17s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 42s. [2025-11-13 09:36:16,503][__main__][INFO] - Starting iteration 254. [2025-11-13 09:36:16,506][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:16,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:25,857][__main__][INFO] - Number of regex retries in iteration 254: 0 [2025-11-13 09:36:25,857][__main__][INFO] - agents played in iteration 254 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:36:26,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:26,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:26,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:26,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:26,401][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:26,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:27,101][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:27,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:28,053][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:28,383][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:29,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:29,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:29,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:30,021][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:30,351][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:30,683][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:31,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:31,666][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:31,996][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:32,654][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:32,984][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:33,311][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:33,639][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:33,968][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:34,295][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:34,955][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:35,284][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:35,947][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:36,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:36,604][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:36,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:37,264][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:37,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:38,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:38,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:38,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:38,999][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:39,911][__main__][INFO] - Iteration 255 took 23s (39.95% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 25s. Estimated total time: 19h 30m 15s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 2s. [2025-11-13 09:36:39,913][__main__][INFO] - Starting iteration 255. [2025-11-13 09:36:39,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:39,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:49,216][__main__][INFO] - Number of regex retries in iteration 255: 0 [2025-11-13 09:36:49,216][__main__][INFO] - agents played in iteration 255 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:36:49,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:49,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:49,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:49,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:49,760][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:49,760][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:50,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:50,757][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:51,085][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:51,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:52,068][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:53,052][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:53,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:53,709][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:54,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:55,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:55,688][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:56,017][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:56,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:56,669][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:56,994][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:57,321][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:57,978][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:58,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:58,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:59,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:00,266][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:00,594][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:00,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:01,624][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:02,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:02,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:02,320][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:03,260][__main__][INFO] - Iteration 256 took 23s (39.84% Gen, 56.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 1s. Estimated total time: 19h 27m 15s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 32s. [2025-11-13 09:37:03,262][__main__][INFO] - Starting iteration 256. [2025-11-13 09:37:03,265][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:03,266][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:12,815][__main__][INFO] - Number of regex retries in iteration 256: 0 [2025-11-13 09:37:12,816][__main__][INFO] - agents played in iteration 256 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:37:13,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:13,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:13,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:13,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:13,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:13,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:14,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:15,044][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:15,373][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:15,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:16,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:17,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:17,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:17,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:18,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:18,326][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:18,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:19,310][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:20,945][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:21,273][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:21,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:21,927][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:22,252][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:22,580][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:22,906][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:23,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:23,559][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:23,885][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:24,211][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:24,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:25,246][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:25,932][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:25,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:25,935][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:26,828][__main__][INFO] - Iteration 257 took 23s (40.53% Gen, 55.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 35s. Estimated total time: 19h 38m 13s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 22s. [2025-11-13 09:37:26,830][__main__][INFO] - Starting iteration 257. [2025-11-13 09:37:26,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:26,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:35,922][__main__][INFO] - Number of regex retries in iteration 257: 0 [2025-11-13 09:37:35,922][__main__][INFO] - agents played in iteration 257 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:37:36,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:36,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:36,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:36,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:36,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:36,468][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:37,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:37,468][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:37,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:38,121][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:38,451][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:39,112][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:39,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:39,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:40,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:40,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:41,404][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:41,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:42,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:42,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:43,368][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:43,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:44,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:45,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:45,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:46,964][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:47,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:48,327][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:49,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:49,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:49,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:49,927][__main__][INFO] - Iteration 258 took 23s (39.35% Gen, 56.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 43s. Estimated total time: 19h 14m 44s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 27s. [2025-11-13 09:37:49,930][__main__][INFO] - Starting iteration 258. [2025-11-13 09:37:49,933][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:49,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:58,670][__main__][INFO] - Number of regex retries in iteration 258: 0 [2025-11-13 09:37:58,671][__main__][INFO] - agents played in iteration 258 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:37:59,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:59,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:59,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:59,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:59,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:59,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:59,922][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:00,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:00,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:01,520][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:01,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:02,175][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:02,502][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:02,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:03,161][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:03,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:03,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:04,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:04,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:04,796][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:05,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:05,775][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:06,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:06,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:06,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:07,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:07,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:08,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:08,389][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:08,719][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:10,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:10,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:11,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:11,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:11,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:11,753][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:12,648][__main__][INFO] - Iteration 259 took 22s (38.46% Gen, 57.59% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 21m 25s. Estimated total time: 18h 55m 49s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 51s, 500 more iterations: 3h 9m 18s. [2025-11-13 09:38:12,651][__main__][INFO] - Starting iteration 259. [2025-11-13 09:38:12,654][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:12,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:21,635][__main__][INFO] - Number of regex retries in iteration 259: 0 [2025-11-13 09:38:21,636][__main__][INFO] - agents played in iteration 259 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:38:22,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:22,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:22,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:22,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:22,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:22,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:23,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:23,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:23,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:24,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:24,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:25,480][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:25,813][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:26,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:26,476][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:26,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:27,458][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:27,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:28,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:29,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:29,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:29,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:30,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:31,724][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:32,378][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:33,031][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:33,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:34,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:34,771][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:34,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:34,774][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:35,678][__main__][INFO] - Iteration 260 took 23s (39.01% Gen, 57.06% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 36m 29s. Estimated total time: 19h 11m 15s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 52s. [2025-11-13 09:38:35,680][__main__][INFO] - Starting iteration 260. [2025-11-13 09:38:35,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:35,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:44,274][__main__][INFO] - Number of regex retries in iteration 260: 0 [2025-11-13 09:38:44,275][__main__][INFO] - agents played in iteration 260 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:38:44,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:44,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:45,838][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:46,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:46,822][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:47,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:48,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:49,123][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:49,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:49,786][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:50,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:50,776][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:51,103][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:51,431][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:52,412][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:52,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:53,065][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:53,392][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:53,719][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:54,047][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:54,374][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:54,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:55,355][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:55,681][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:56,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:56,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:57,393][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:57,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:57,395][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:59,166][__main__][INFO] - Iteration 261 took 23s (36.58% Gen, 55.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 59m 2s. Estimated total time: 19h 34m 12s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 42s. [2025-11-13 09:38:59,168][__main__][INFO] - Starting iteration 261. [2025-11-13 09:38:59,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:38:59,171][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:07,907][__main__][INFO] - Number of regex retries in iteration 261: 0 [2025-11-13 09:39:07,907][__main__][INFO] - agents played in iteration 261 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:39:08,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:08,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:08,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:08,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:08,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:08,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:09,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:09,480][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:09,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:10,132][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:10,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:10,793][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:11,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:11,779][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:12,109][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:12,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:12,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:13,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:15,064][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:15,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:16,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:16,370][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:16,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:17,354][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:17,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:18,334][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:18,659][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:18,986][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:19,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:19,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:20,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:21,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:21,037][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:21,039][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:21,945][__main__][INFO] - Iteration 262 took 22s (38.36% Gen, 57.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 23m 11s. Estimated total time: 18h 58m 43s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 47s. [2025-11-13 09:39:21,947][__main__][INFO] - Starting iteration 262. [2025-11-13 09:39:21,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:21,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:31,012][__main__][INFO] - Number of regex retries in iteration 262: 0 [2025-11-13 09:39:31,012][__main__][INFO] - agents played in iteration 262 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:39:31,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:31,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:31,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:31,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:31,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:31,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:32,287][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:32,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:33,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:33,890][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:34,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:34,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:35,197][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:35,523][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:36,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:36,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:37,158][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:37,484][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:37,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:38,788][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:39,114][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:39,768][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:40,422][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:40,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:41,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:41,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:42,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:43,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:44,097][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:44,099][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:44,100][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:45,014][__main__][INFO] - Iteration 263 took 23s (39.29% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 20s. Estimated total time: 19h 13m 16s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 12s. [2025-11-13 09:39:45,016][__main__][INFO] - Starting iteration 263. [2025-11-13 09:39:45,019][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:45,019][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:50,242][mllm.models.large_language_model_local][WARNING] - Response bụ did not match regex: (|), retry 1/1 [2025-11-13 09:39:53,944][__main__][INFO] - Number of regex retries in iteration 263: 1 [2025-11-13 09:39:53,945][__main__][INFO] - agents played in iteration 263 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:39:54,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:54,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:54,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:54,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:54,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:54,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:55,236][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:56,185][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:57,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:57,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:57,827][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:58,159][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:58,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:58,821][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:59,476][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:00,464][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:01,118][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:01,445][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:02,099][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:02,754][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:03,080][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:03,734][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:04,385][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:04,712][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:05,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:05,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:05,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:06,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:07,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:07,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:07,084][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:08,029][__main__][INFO] - Iteration 264 took 23s (38.79% Gen, 57.10% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 34m 14s. Estimated total time: 19h 10m 33s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 45s. [2025-11-13 09:40:08,031][__main__][INFO] - Starting iteration 264. [2025-11-13 09:40:08,034][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:08,035][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:16,981][__main__][INFO] - Number of regex retries in iteration 264: 0 [2025-11-13 09:40:16,982][__main__][INFO] - agents played in iteration 264 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:40:17,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:17,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:17,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:17,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:17,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:17,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:18,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:18,906][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:19,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:19,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:20,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:20,886][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:21,540][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:21,867][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:22,193][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:22,519][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:22,846][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:23,507][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:23,833][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:24,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:24,488][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:25,472][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:25,800][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:26,127][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:26,780][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:27,107][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:27,433][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:27,760][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:28,086][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:28,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:29,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:30,139][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:30,141][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:30,144][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:31,056][__main__][INFO] - Iteration 265 took 23s (38.86% Gen, 57.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 34m 26s. Estimated total time: 19h 11m 7s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 51s. [2025-11-13 09:40:31,058][__main__][INFO] - Starting iteration 265. [2025-11-13 09:40:31,062][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:31,063][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:40,166][__main__][INFO] - Number of regex retries in iteration 265: 0 [2025-11-13 09:40:40,167][__main__][INFO] - agents played in iteration 265 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:40:40,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:40,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:40,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:40,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:40,776][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:40,777][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:41,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:41,797][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:42,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:42,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:42,779][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:43,105][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:44,086][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:44,414][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:44,744][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:45,728][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:46,055][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:47,692][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:48,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:49,006][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:49,334][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:49,661][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:50,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:51,297][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:51,623][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:51,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:52,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:53,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:53,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:53,347][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:54,272][__main__][INFO] - Iteration 266 took 23s (39.22% Gen, 56.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 26s. Estimated total time: 19h 20m 31s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 25s. [2025-11-13 09:40:54,274][__main__][INFO] - Starting iteration 266. [2025-11-13 09:40:54,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:54,278][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:02,755][__main__][INFO] - Number of regex retries in iteration 266: 0 [2025-11-13 09:41:02,756][__main__][INFO] - agents played in iteration 266 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:41:03,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:03,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:03,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:03,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:03,320][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:03,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:04,338][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:04,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:05,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:06,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:06,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:06,961][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:07,290][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:07,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:08,273][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:08,600][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:09,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:09,584][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:09,913][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:10,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:11,550][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:12,862][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:13,189][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:13,846][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:14,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:14,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:15,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:15,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:15,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:15,925][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:16,879][__main__][INFO] - Iteration 267 took 22s (37.51% Gen, 58.27% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 12m 41s. Estimated total time: 18h 50m 8s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 40s, 500 more iterations: 3h 8m 21s. [2025-11-13 09:41:16,882][__main__][INFO] - Starting iteration 267. [2025-11-13 09:41:16,885][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:16,885][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:25,781][__main__][INFO] - Number of regex retries in iteration 267: 0 [2025-11-13 09:41:25,782][__main__][INFO] - agents played in iteration 267 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:41:26,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:26,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:26,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:26,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:26,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:26,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:27,358][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:27,684][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:28,011][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:28,337][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:28,663][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:28,989][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:29,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:30,294][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:30,950][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:31,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:31,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:32,255][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:32,913][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:33,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:33,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:34,227][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:34,882][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:35,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:35,861][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:36,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:36,842][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:37,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:37,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:38,209][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:38,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:38,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:38,910][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:39,978][__main__][INFO] - Iteration 268 took 23s (38.52% Gen, 56.85% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 36m 52s. Estimated total time: 19h 14m 42s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 27s. [2025-11-13 09:41:39,980][__main__][INFO] - Starting iteration 268. [2025-11-13 09:41:39,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:39,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:48,880][__main__][INFO] - Number of regex retries in iteration 268: 0 [2025-11-13 09:41:48,880][__main__][INFO] - agents played in iteration 268 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:41:49,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:49,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:49,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:49,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:49,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:49,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:50,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:51,107][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:51,763][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:52,094][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:52,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:53,402][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:53,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:54,055][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:54,383][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:54,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:55,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:55,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:56,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:56,369][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:57,026][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:57,355][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:57,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:58,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:58,989][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:59,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:59,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:00,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:01,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:02,022][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:02,024][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:02,025][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:02,991][__main__][INFO] - Iteration 269 took 23s (38.67% Gen, 57.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 32m 10s. Estimated total time: 19h 10m 24s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 44s. [2025-11-13 09:42:02,993][__main__][INFO] - Starting iteration 269. [2025-11-13 09:42:02,996][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:42:02,996][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:11,666][__main__][INFO] - Number of regex retries in iteration 269: 0 [2025-11-13 09:42:11,667][__main__][INFO] - agents played in iteration 269 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:42:12,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:12,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:12,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:12,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:12,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:12,217][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:13,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:13,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:14,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:14,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:14,879][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:15,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:15,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:16,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:17,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:17,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:18,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:19,125][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:19,452][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:19,779][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:20,435][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:21,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:21,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:21,740][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:22,069][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:22,395][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:22,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:23,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:23,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:24,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:24,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:24,780][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:24,781][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:25,682][__main__][INFO] - Iteration 270 took 22s (38.22% Gen, 57.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 43s. Estimated total time: 18h 54m 20s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 48s, 500 more iterations: 3h 9m 3s. [2025-11-13 09:42:25,684][__main__][INFO] - Starting iteration 270. [2025-11-13 09:42:25,687][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:42:25,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:33,918][__main__][INFO] - Number of regex retries in iteration 270: 0 [2025-11-13 09:42:33,918][__main__][INFO] - agents played in iteration 270 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:42:34,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:34,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:34,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:34,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:34,474][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:34,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:35,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:35,830][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:36,481][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:36,807][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:37,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:37,788][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:38,115][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:38,441][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:38,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:40,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:40,725][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:41,379][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:42,033][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:42,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:43,023][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:43,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:44,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:44,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:44,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:45,328][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:45,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:46,354][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:47,053][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:47,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:47,056][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:48,789][__main__][INFO] - Iteration 271 took 23s (35.62% Gen, 56.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 36m 9s. Estimated total time: 19h 15m 9s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 31s. [2025-11-13 09:42:48,791][__main__][INFO] - Starting iteration 271. [2025-11-13 09:42:48,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:42:48,794][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:57,413][__main__][INFO] - Number of regex retries in iteration 271: 0 [2025-11-13 09:42:57,414][__main__][INFO] - agents played in iteration 271 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:42:57,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:57,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:57,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:57,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:57,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:57,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:58,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:59,318][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:59,643][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:59,970][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:00,298][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:00,625][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:01,279][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:01,605][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:01,932][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:02,258][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:02,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:03,242][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:03,897][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:04,226][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:04,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:04,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:05,206][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:05,533][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:06,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:06,518][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:06,846][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:07,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:07,832][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:08,813][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:09,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:09,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:10,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:10,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:10,538][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:11,665][__main__][INFO] - Iteration 272 took 22s (37.68% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 24m 15s. Estimated total time: 19h 3m 37s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 36s. [2025-11-13 09:43:11,667][__main__][INFO] - Starting iteration 272. [2025-11-13 09:43:11,670][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:11,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:20,445][__main__][INFO] - Number of regex retries in iteration 272: 0 [2025-11-13 09:43:20,446][__main__][INFO] - agents played in iteration 272 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:43:20,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:20,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:20,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:21,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:21,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:21,003][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:21,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:22,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:22,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:22,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:23,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:23,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:23,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:24,624][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:24,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:25,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:25,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:25,927][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:26,254][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:26,581][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:26,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:27,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:27,558][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:27,885][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:28,212][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:28,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:28,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:29,191][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:29,520][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:29,852][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:30,185][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:30,839][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:31,166][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:31,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:31,819][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:32,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:32,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:33,531][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:33,533][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:33,535][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:34,441][__main__][INFO] - Iteration 273 took 22s (38.53% Gen, 57.48% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 18m 49s. Estimated total time: 18h 58m 34s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 45s. [2025-11-13 09:43:34,443][__main__][INFO] - Starting iteration 273. [2025-11-13 09:43:34,447][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:34,447][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:43,616][__main__][INFO] - Number of regex retries in iteration 273: 0 [2025-11-13 09:43:43,617][__main__][INFO] - agents played in iteration 273 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:43:44,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:44,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:45,197][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:45,524][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:47,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:47,808][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:48,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:48,460][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:48,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:49,439][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:49,765][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:50,092][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:50,749][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:51,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:51,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:51,741][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:52,722][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:53,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:53,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:53,702][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:54,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:54,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:55,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:56,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:56,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:56,734][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:56,736][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:57,620][__main__][INFO] - Iteration 274 took 23s (39.57% Gen, 56.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 36s. Estimated total time: 19h 18m 44s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 7s. [2025-11-13 09:43:57,623][__main__][INFO] - Starting iteration 274. [2025-11-13 09:43:57,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:57,627][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:06,193][__main__][INFO] - Number of regex retries in iteration 274: 0 [2025-11-13 09:44:06,193][__main__][INFO] - agents played in iteration 274 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:44:06,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:06,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:06,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:06,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:06,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:06,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:07,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:07,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:09,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:09,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:10,065][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:10,393][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:10,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:12,027][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:12,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:12,684][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:13,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:13,996][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:14,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:14,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:15,310][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:15,972][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:17,608][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:17,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:18,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:19,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:19,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:19,320][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:20,210][__main__][INFO] - Iteration 275 took 22s (37.93% Gen, 58.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 8m 41s. Estimated total time: 18h 49m 12s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 12s. [2025-11-13 09:44:20,212][__main__][INFO] - Starting iteration 275. [2025-11-13 09:44:20,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:20,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:29,522][__main__][INFO] - Number of regex retries in iteration 275: 0 [2025-11-13 09:44:29,522][__main__][INFO] - agents played in iteration 275 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:44:29,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:30,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:30,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:30,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:30,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:30,082][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:30,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:33,072][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:33,725][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:34,052][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:34,378][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:35,032][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:35,683][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:37,323][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:38,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:38,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:38,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:39,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:39,620][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:40,275][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:40,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:41,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:41,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:42,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:42,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:42,656][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:43,559][__main__][INFO] - Iteration 276 took 23s (39.87% Gen, 56.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 20s. Estimated total time: 19h 27m 15s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 32s. [2025-11-13 09:44:43,561][__main__][INFO] - Starting iteration 276. [2025-11-13 09:44:43,564][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:43,565][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:52,545][__main__][INFO] - Number of regex retries in iteration 276: 0 [2025-11-13 09:44:52,545][__main__][INFO] - agents played in iteration 276 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:44:53,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:53,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:53,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:53,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:53,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:53,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:53,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:54,131][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:54,785][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:55,765][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:56,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:56,418][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:56,744][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:57,071][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:57,725][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:58,052][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:59,694][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:00,023][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:00,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:00,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:01,668][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:01,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:02,322][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:02,650][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:02,979][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:03,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:04,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:04,997][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:05,694][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:05,696][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:05,697][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:06,695][__main__][INFO] - Iteration 277 took 23s (38.82% Gen, 56.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 35m 16s. Estimated total time: 19h 16m 34s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 45s. [2025-11-13 09:45:06,697][__main__][INFO] - Starting iteration 277. [2025-11-13 09:45:06,699][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:06,700][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:15,560][__main__][INFO] - Number of regex retries in iteration 277: 0 [2025-11-13 09:45:15,561][__main__][INFO] - agents played in iteration 277 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:45:16,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:16,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:16,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:16,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:16,112][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:16,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:16,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:17,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:17,805][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:18,456][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:18,782][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:19,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:19,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:19,761][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:20,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:20,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:21,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:21,721][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:22,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:22,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:22,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:23,032][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:24,018][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:24,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:25,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:25,654][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:25,981][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:26,635][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:27,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:28,008][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:28,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:28,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:28,705][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:29,646][__main__][INFO] - Iteration 278 took 22s (38.61% Gen, 57.28% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 25m 41s. Estimated total time: 19h 7m 22s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 13s. [2025-11-13 09:45:29,648][__main__][INFO] - Starting iteration 278. [2025-11-13 09:45:29,652][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:29,652][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:38,344][__main__][INFO] - Number of regex retries in iteration 278: 0 [2025-11-13 09:45:38,345][__main__][INFO] - agents played in iteration 278 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:45:38,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:38,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:38,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:38,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:38,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:38,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:39,930][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:40,256][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:40,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:40,908][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:41,234][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:41,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:42,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:42,538][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:43,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:43,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:43,846][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:44,173][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:44,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:44,828][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:45,154][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:45,487][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:45,814][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:46,142][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:47,123][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:47,449][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:47,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:48,427][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:48,755][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:49,081][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:49,408][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:49,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:50,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:50,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:51,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:51,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:51,486][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:52,411][__main__][INFO] - Iteration 279 took 22s (38.19% Gen, 57.74% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 58s. Estimated total time: 18h 58m 2s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 40s. [2025-11-13 09:45:52,413][__main__][INFO] - Starting iteration 279. [2025-11-13 09:45:52,416][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:52,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:00,998][__main__][INFO] - Number of regex retries in iteration 279: 0 [2025-11-13 09:46:00,999][__main__][INFO] - agents played in iteration 279 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:46:01,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:01,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:01,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:01,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:01,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:01,548][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:02,274][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:02,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:02,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:03,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:03,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:03,884][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:04,211][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:04,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:05,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:05,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:06,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:06,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:07,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:07,476][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:08,457][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:08,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:09,110][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:10,093][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:10,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:10,756][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:11,082][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:11,735][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:12,061][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:12,388][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:12,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:13,448][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:14,140][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:14,142][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:14,144][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:15,062][__main__][INFO] - Iteration 280 took 22s (37.89% Gen, 58.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 9m 54s. Estimated total time: 18h 52m 20s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 44s, 500 more iterations: 3h 8m 43s. [2025-11-13 09:46:15,065][__main__][INFO] - Starting iteration 280. [2025-11-13 09:46:15,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:46:15,069][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:24,084][__main__][INFO] - Number of regex retries in iteration 280: 0 [2025-11-13 09:46:24,085][__main__][INFO] - agents played in iteration 280 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:46:24,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:24,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:24,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:24,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:24,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:24,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:25,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:25,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:26,307][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:26,633][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:26,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:27,296][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:27,626][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:27,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:28,282][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:28,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:29,590][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:29,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:30,243][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:30,895][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:31,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:31,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:32,205][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:32,532][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:33,192][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:33,519][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:33,848][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:34,503][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:34,829][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:35,156][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:35,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:36,528][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:37,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:37,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:37,226][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:38,974][__main__][INFO] - Iteration 281 took 23s (37.71% Gen, 54.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 29s. Estimated total time: 19h 55m 19s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 13s. [2025-11-13 09:46:38,976][__main__][INFO] - Starting iteration 281. [2025-11-13 09:46:38,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:46:38,980][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:47,616][__main__][INFO] - Number of regex retries in iteration 281: 0 [2025-11-13 09:46:47,616][__main__][INFO] - agents played in iteration 281 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:46:48,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:48,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:48,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:48,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:48,149][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:48,150][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:49,137][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:49,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:49,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:50,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:50,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:50,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:51,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:51,773][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:52,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:52,429][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:52,757][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:53,085][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:53,412][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:53,740][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:54,069][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:54,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:55,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:55,701][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:56,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:57,338][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:57,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:57,992][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:58,318][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:58,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:59,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:00,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:00,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:00,730][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:00,732][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:01,644][__main__][INFO] - Iteration 282 took 22s (38.10% Gen, 57.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 10m 5s. Estimated total time: 18h 53m 17s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 52s. [2025-11-13 09:47:01,646][__main__][INFO] - Starting iteration 282. [2025-11-13 09:47:01,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:01,649][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:10,815][__main__][INFO] - Number of regex retries in iteration 282: 0 [2025-11-13 09:47:10,815][__main__][INFO] - agents played in iteration 282 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:47:11,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:11,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:11,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:11,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:11,354][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:11,355][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:12,038][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:12,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:12,988][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:13,320][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:13,653][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:13,980][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:14,306][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:14,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:15,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:15,627][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:15,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:16,284][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:16,610][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:16,936][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:17,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:17,589][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:18,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:18,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:18,900][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:19,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:20,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:20,539][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:20,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:21,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:21,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:22,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:22,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:23,217][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:23,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:23,897][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:23,899][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:24,800][__main__][INFO] - Iteration 283 took 23s (39.59% Gen, 56.51% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 2s. Estimated total time: 19h 17m 37s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 56s. [2025-11-13 09:47:24,802][__main__][INFO] - Starting iteration 283. [2025-11-13 09:47:24,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:24,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:33,628][__main__][INFO] - Number of regex retries in iteration 283: 0 [2025-11-13 09:47:33,629][__main__][INFO] - agents played in iteration 283 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:47:34,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:34,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:34,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:34,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:34,166][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:34,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:34,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:35,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:35,471][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:36,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:36,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:36,776][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:37,103][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:37,757][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:38,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:38,741][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:39,073][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:39,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:39,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:40,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:40,379][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:41,032][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:41,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:42,014][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:42,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:42,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:43,323][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:43,650][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:44,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:45,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:45,994][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:46,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:46,694][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:46,696][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:47,609][__main__][INFO] - Iteration 284 took 22s (38.69% Gen, 57.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 16s. Estimated total time: 19h 0m 15s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 0s, 500 more iterations: 3h 10m 2s. [2025-11-13 09:47:47,611][__main__][INFO] - Starting iteration 284. [2025-11-13 09:47:47,614][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:47,615][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:56,827][__main__][INFO] - Number of regex retries in iteration 284: 0 [2025-11-13 09:47:56,827][__main__][INFO] - agents played in iteration 284 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:47:57,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:57,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:57,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:57,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:57,368][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:57,369][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:58,050][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:58,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:58,674][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:59,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:59,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:01,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:01,615][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:02,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:03,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:03,918][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:04,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:04,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:05,226][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:05,882][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:06,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:06,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:07,523][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:07,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:08,178][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:08,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:09,215][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:09,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:09,900][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:09,902][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:10,845][__main__][INFO] - Iteration 285 took 23s (39.65% Gen, 56.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 13s. Estimated total time: 19h 21m 34s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 35s. [2025-11-13 09:48:10,847][__main__][INFO] - Starting iteration 285. [2025-11-13 09:48:10,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:10,850][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:20,115][__main__][INFO] - Number of regex retries in iteration 285: 0 [2025-11-13 09:48:20,115][__main__][INFO] - agents played in iteration 285 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:48:20,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:20,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:20,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:20,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:20,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:20,650][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:21,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:21,639][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:21,965][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:22,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:22,618][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:23,270][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:23,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:23,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:24,252][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:24,580][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:24,907][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:25,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:25,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:25,890][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:26,218][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:27,204][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:27,532][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:27,860][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:28,187][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:28,843][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:29,172][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:29,500][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:29,827][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:30,481][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:30,808][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:31,463][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:31,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:32,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:33,204][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:33,206][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:33,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:34,134][__main__][INFO] - Iteration 286 took 23s (39.79% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 39m 29s. Estimated total time: 19h 24m 14s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 2s. [2025-11-13 09:48:34,136][__main__][INFO] - Starting iteration 286. [2025-11-13 09:48:34,139][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:34,140][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:43,168][__main__][INFO] - Number of regex retries in iteration 286: 0 [2025-11-13 09:48:43,168][__main__][INFO] - agents played in iteration 286 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:48:43,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:43,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:45,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:45,346][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:45,673][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:46,000][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:46,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:46,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:47,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:48,289][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:48,949][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:49,609][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:50,591][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:50,918][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:51,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:51,898][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:52,225][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:52,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:53,534][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:53,860][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:54,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:54,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:55,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:56,246][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:56,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:56,249][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:57,145][__main__][INFO] - Iteration 287 took 23s (39.24% Gen, 56.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 25m 13s. Estimated total time: 19h 10m 21s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 43s. [2025-11-13 09:48:57,147][__main__][INFO] - Starting iteration 287. [2025-11-13 09:48:57,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:57,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:06,168][__main__][INFO] - Number of regex retries in iteration 287: 0 [2025-11-13 09:49:06,168][__main__][INFO] - agents played in iteration 287 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:49:06,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:06,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:06,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:06,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:06,716][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:06,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:07,390][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:07,686][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:08,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:08,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:09,314][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:09,641][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:09,971][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:10,296][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:10,622][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:10,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:11,276][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:11,604][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:11,937][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:12,270][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:12,601][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:12,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:13,256][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:13,585][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:13,913][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:14,240][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:14,568][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:14,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:15,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:15,549][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:17,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:17,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:17,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:18,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:19,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:19,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:19,257][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:20,143][__main__][INFO] - Iteration 288 took 22s (39.22% Gen, 56.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 10s. Estimated total time: 19h 9m 41s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 36s. [2025-11-13 09:49:20,145][__main__][INFO] - Starting iteration 288. [2025-11-13 09:49:20,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:20,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:29,417][__main__][INFO] - Number of regex retries in iteration 288: 0 [2025-11-13 09:49:29,418][__main__][INFO] - agents played in iteration 288 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:49:29,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:29,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:29,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:29,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:29,954][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:29,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:30,936][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:32,241][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:32,895][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:33,220][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:33,873][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:34,524][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:34,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:35,179][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:35,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:35,832][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:36,493][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:36,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:37,151][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:37,481][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:37,807][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:38,133][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:38,789][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:39,115][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:39,443][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:39,770][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:40,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:41,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:41,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:42,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:42,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:42,485][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:43,360][__main__][INFO] - Iteration 289 took 23s (39.93% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 45s. Estimated total time: 19h 20m 39s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 26s. [2025-11-13 09:49:43,362][__main__][INFO] - Starting iteration 289. [2025-11-13 09:49:43,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:43,365][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:52,320][__main__][INFO] - Number of regex retries in iteration 289: 0 [2025-11-13 09:49:52,321][__main__][INFO] - agents played in iteration 289 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:49:52,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:52,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:52,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:52,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:52,854][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:52,855][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:53,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:53,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:54,152][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:54,477][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:54,805][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:55,134][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:55,464][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:56,127][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:56,455][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:56,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:57,107][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:58,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:58,744][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:59,397][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:00,053][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:00,384][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:01,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:01,374][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:01,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:02,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:02,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:03,018][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:03,347][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:03,675][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:04,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:04,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:05,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:05,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:05,423][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:06,322][__main__][INFO] - Iteration 290 took 22s (39.01% Gen, 57.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 21m 37s. Estimated total time: 19h 7m 54s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 19s. [2025-11-13 09:50:06,324][__main__][INFO] - Starting iteration 290. [2025-11-13 09:50:06,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:50:06,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:15,632][__main__][INFO] - Number of regex retries in iteration 290: 0 [2025-11-13 09:50:15,633][__main__][INFO] - agents played in iteration 290 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:50:16,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:16,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:16,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:17,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:17,524][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:18,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:18,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:18,827][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:19,154][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:19,807][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:21,438][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:21,769][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:22,421][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:22,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:23,076][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:23,405][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:23,732][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:24,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:24,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:24,714][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:25,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:25,370][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:25,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:26,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:26,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:27,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:27,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:28,051][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:28,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:28,741][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:28,742][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:30,477][__main__][INFO] - Iteration 291 took 24s (38.53% Gen, 54.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 53s. Estimated total time: 20h 7m 35s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 15s. [2025-11-13 09:50:30,479][__main__][INFO] - Starting iteration 291. [2025-11-13 09:50:30,482][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:30,482][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:39,877][__main__][INFO] - Number of regex retries in iteration 291: 0 [2025-11-13 09:50:39,877][__main__][INFO] - agents played in iteration 291 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:50:40,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:40,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:41,152][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:41,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:41,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:42,100][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:42,425][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:42,752][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:43,077][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:43,404][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:43,730][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:44,057][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:44,385][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:44,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:45,039][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:45,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:46,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:46,350][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:46,681][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:47,013][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:47,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:47,670][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:48,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:48,651][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:49,309][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:49,638][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:49,967][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:50,296][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:50,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:51,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:52,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:53,022][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:53,024][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:53,025][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:53,908][__main__][INFO] - Iteration 292 took 23s (40.10% Gen, 56.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 15s. Estimated total time: 19h 31m 20s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 13s. [2025-11-13 09:50:53,910][__main__][INFO] - Starting iteration 292. [2025-11-13 09:50:53,913][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:53,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:04,573][__main__][INFO] - Number of regex retries in iteration 292: 0 [2025-11-13 09:51:04,574][__main__][INFO] - agents played in iteration 292 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:51:05,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:05,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:05,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:05,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:05,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:05,126][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:05,817][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:06,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:06,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:07,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:07,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:08,070][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:08,397][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:08,723][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:09,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:10,028][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:10,354][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:11,006][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:11,334][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:11,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:12,315][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:12,974][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:13,300][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:13,627][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:13,959][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:14,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:14,941][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:15,267][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:15,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:16,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:16,975][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:17,681][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:17,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:17,684][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:18,566][__main__][INFO] - Iteration 293 took 24s (43.24% Gen, 53.18% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 45m 13s. Estimated total time: 20h 32m 42s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 5s, 500 more iterations: 3h 25m 27s. [2025-11-13 09:51:18,568][__main__][INFO] - Starting iteration 293. [2025-11-13 09:51:18,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:18,571][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:28,206][__main__][INFO] - Number of regex retries in iteration 293: 0 [2025-11-13 09:51:28,207][__main__][INFO] - agents played in iteration 293 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:51:28,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,769][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:28,769][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:30,099][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:30,426][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:30,754][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:31,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:31,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:31,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:32,067][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:32,393][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:32,718][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:33,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:34,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:34,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:34,675][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:35,001][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:35,327][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:35,654][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:35,982][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:36,310][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:36,638][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:36,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:37,296][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:37,952][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:38,929][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:39,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:40,633][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:41,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:41,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:41,326][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:42,208][__main__][INFO] - Iteration 294 took 23s (40.76% Gen, 55.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 1s. Estimated total time: 19h 41m 54s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 59s. [2025-11-13 09:51:42,210][__main__][INFO] - Starting iteration 294. [2025-11-13 09:51:42,270][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:42,271][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:51,370][__main__][INFO] - Number of regex retries in iteration 294: 0 [2025-11-13 09:51:51,371][__main__][INFO] - agents played in iteration 294 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:51:51,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:51,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:51,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:51,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:51,924][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:51,925][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:52,643][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:52,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:53,595][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:54,249][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:54,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:57,184][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:57,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:57,837][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:58,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:58,488][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:59,142][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:59,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:59,799][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:00,124][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:01,107][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:01,434][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:01,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:02,088][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:02,414][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:03,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:03,806][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:04,493][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:04,495][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:04,496][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:05,413][__main__][INFO] - Iteration 295 took 23s (39.22% Gen, 56.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 49s. Estimated total time: 19h 20m 6s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 21s. [2025-11-13 09:52:05,415][__main__][INFO] - Starting iteration 295. [2025-11-13 09:52:05,418][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:05,419][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:14,831][__main__][INFO] - Number of regex retries in iteration 295: 0 [2025-11-13 09:52:14,832][__main__][INFO] - agents played in iteration 295 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:52:15,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:15,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:15,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:15,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:15,368][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:15,368][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:16,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:16,709][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:17,036][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:17,690][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:18,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:18,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:18,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:18,991][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:19,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:19,645][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:19,974][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:20,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:20,629][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:20,954][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:21,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:22,587][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:22,914][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:23,242][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:23,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:23,902][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:24,234][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:24,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:26,195][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:26,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:27,255][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:27,942][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:27,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:27,946][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:28,819][__main__][INFO] - Iteration 296 took 23s (40.22% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 24s. Estimated total time: 19h 30m 4s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 0s. [2025-11-13 09:52:28,821][__main__][INFO] - Starting iteration 296. [2025-11-13 09:52:28,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:28,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:38,692][__main__][INFO] - Number of regex retries in iteration 296: 0 [2025-11-13 09:52:38,692][__main__][INFO] - agents played in iteration 296 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:52:39,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:39,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:39,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:39,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:39,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:39,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:39,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:40,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:40,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:40,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:41,261][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:41,588][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:41,916][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:42,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:42,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:43,551][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:43,877][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:44,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:44,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:45,181][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:45,510][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:45,836][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:46,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:46,820][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:47,482][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:47,812][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:48,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:48,467][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:49,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:49,771][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:50,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:51,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:51,830][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:51,831][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:51,833][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:52,713][__main__][INFO] - Iteration 297 took 23s (41.31% Gen, 55.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 28s. Estimated total time: 19h 54m 32s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 5s. [2025-11-13 09:52:52,715][__main__][INFO] - Starting iteration 297. [2025-11-13 09:52:52,719][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:52,719][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:02,586][__main__][INFO] - Number of regex retries in iteration 297: 0 [2025-11-13 09:53:02,586][__main__][INFO] - agents played in iteration 297 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:53:03,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,136][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:03,136][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:03,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:04,154][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:04,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:05,464][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:05,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:06,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:06,771][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:07,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:07,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:08,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:08,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:08,729][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:09,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:09,380][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:09,706][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:10,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:10,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:11,013][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:11,669][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:12,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:12,652][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:12,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:13,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:14,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:15,008][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:15,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:15,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:15,719][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:16,604][__main__][INFO] - Iteration 298 took 23s (41.31% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 52s. Estimated total time: 19h 54m 19s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 3s. [2025-11-13 09:53:16,607][__main__][INFO] - Starting iteration 298. [2025-11-13 09:53:16,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:16,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:26,334][__main__][INFO] - Number of regex retries in iteration 298: 0 [2025-11-13 09:53:26,335][__main__][INFO] - agents played in iteration 298 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:53:26,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:26,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:27,878][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:28,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:28,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:29,516][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:31,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:32,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:32,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:33,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:33,774][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:34,101][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:34,429][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:34,758][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:35,084][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:35,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:35,737][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:36,064][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:36,390][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:36,718][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:37,045][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:37,372][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:38,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:38,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:39,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:39,449][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:39,451][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:40,368][__main__][INFO] - Iteration 299 took 23s (40.93% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 58m 6s. Estimated total time: 19h 47m 57s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 59s. [2025-11-13 09:53:40,370][__main__][INFO] - Starting iteration 299. [2025-11-13 09:53:40,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:40,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:50,340][__main__][INFO] - Number of regex retries in iteration 299: 0 [2025-11-13 09:53:50,340][__main__][INFO] - agents played in iteration 299 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:53:50,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:50,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:50,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:50,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:50,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:50,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:52,186][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:52,512][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:53,164][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:53,816][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:54,467][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:54,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:55,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:55,772][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:56,100][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:56,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:57,080][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:57,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:57,736][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:58,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:58,394][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:59,051][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:59,382][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:59,708][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:00,035][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:00,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:00,689][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:01,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:01,343][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:01,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:02,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:03,424][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:03,425][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:03,434][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:04,340][__main__][INFO] - Iteration 300 took 23s (41.58% Gen, 54.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 9s. Estimated total time: 19h 58m 24s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 44s. [2025-11-13 09:54:04,342][__main__][INFO] - Starting iteration 300. [2025-11-13 09:54:04,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:54:04,345][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:13,840][__main__][INFO] - Number of regex retries in iteration 300: 0 [2025-11-13 09:54:13,841][__main__][INFO] - agents played in iteration 300 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:54:14,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,393][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:14,394][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:15,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:15,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:16,056][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:17,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:17,367][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:18,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:18,684][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:19,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:19,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:19,668][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:20,973][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:21,302][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:21,960][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:22,289][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:22,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:22,941][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:23,268][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:23,925][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:24,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:24,579][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:24,906][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:25,233][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:25,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:26,290][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:26,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:27,114][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:27,116][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:28,860][__main__][INFO] - Iteration 301 took 24s (38.73% Gen, 54.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 35m 11s. Estimated total time: 20h 25m 50s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 51s, 500 more iterations: 3h 24m 18s. [2025-11-13 09:54:28,863][__main__][INFO] - Starting iteration 301. [2025-11-13 09:54:28,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:28,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:38,371][__main__][INFO] - Number of regex retries in iteration 301: 0 [2025-11-13 09:54:38,371][__main__][INFO] - agents played in iteration 301 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:54:38,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,926][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:38,926][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:39,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:40,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:40,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:40,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:41,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:41,882][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:42,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:43,533][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:44,184][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:44,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:44,837][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:45,163][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:45,818][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:46,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:46,478][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:46,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:47,138][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:47,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:48,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:48,786][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:49,440][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:50,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:50,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:51,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:51,520][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:51,522][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:52,414][__main__][INFO] - Iteration 302 took 23s (40.36% Gen, 55.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 27s. Estimated total time: 19h 37m 31s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 15s. [2025-11-13 09:54:52,416][__main__][INFO] - Starting iteration 302. [2025-11-13 09:54:52,419][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:52,420][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:01,887][__main__][INFO] - Number of regex retries in iteration 302: 0 [2025-11-13 09:55:01,888][__main__][INFO] - agents played in iteration 302 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:55:02,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,442][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:02,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:03,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:03,797][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:04,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:04,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:04,777][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:05,102][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:05,429][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:06,088][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:06,746][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:07,073][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:07,400][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:07,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:08,055][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:08,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:09,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:09,703][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:10,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:10,363][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:10,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:11,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:11,353][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:12,338][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:12,668][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:12,996][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:13,322][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:13,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:14,394][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:15,099][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:15,100][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:15,102][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:16,002][__main__][INFO] - Iteration 303 took 23s (40.15% Gen, 56.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 43s. Estimated total time: 19h 39m 9s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 31s. [2025-11-13 09:55:16,004][__main__][INFO] - Starting iteration 303. [2025-11-13 09:55:16,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:16,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:25,077][__main__][INFO] - Number of regex retries in iteration 303: 0 [2025-11-13 09:55:25,077][__main__][INFO] - agents played in iteration 303 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:55:25,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:25,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:25,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:25,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:25,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:25,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:26,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:26,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:26,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:27,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:28,588][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:28,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:29,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:29,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:30,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:30,546][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:31,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:32,191][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:32,518][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:32,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:33,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:33,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:33,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:34,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:34,484][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:34,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:35,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:35,467][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:35,795][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:36,121][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:36,448][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:36,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:37,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:38,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:38,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:38,193][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:39,066][__main__][INFO] - Iteration 304 took 23s (39.33% Gen, 56.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 10s. Estimated total time: 19h 13m 0s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 10s. [2025-11-13 09:55:39,068][__main__][INFO] - Starting iteration 304. [2025-11-13 09:55:39,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:39,072][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:45,907][mllm.models.large_language_model_local][WARNING] - Response %A did not match regex: (|), retry 1/1 [2025-11-13 09:55:48,675][__main__][INFO] - Number of regex retries in iteration 304: 1 [2025-11-13 09:55:48,676][__main__][INFO] - agents played in iteration 304 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:55:49,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:49,233][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:49,944][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:50,241][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:50,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:50,898][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:51,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:51,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:52,208][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:52,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:53,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:53,837][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:54,816][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:55,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:55,471][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:57,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:57,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:57,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:58,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:58,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:58,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:59,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:00,059][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:00,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:01,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:01,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:01,818][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:01,820][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:02,714][__main__][INFO] - Iteration 305 took 23s (40.62% Gen, 55.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 58s. Estimated total time: 19h 42m 12s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 09:56:02,717][__main__][INFO] - Starting iteration 305. [2025-11-13 09:56:02,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:02,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:12,601][__main__][INFO] - Number of regex retries in iteration 305: 0 [2025-11-13 09:56:12,601][__main__][INFO] - agents played in iteration 305 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:56:13,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,157][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:13,158][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:14,172][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:14,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:14,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:15,153][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:15,478][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:15,803][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:16,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:16,456][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:16,782][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:17,110][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:17,437][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:17,768][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:18,100][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:18,755][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:19,411][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:20,392][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:21,049][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:21,377][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:21,703][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:22,029][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:22,686][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:23,015][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:23,342][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:23,997][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:24,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:25,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:25,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:25,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:25,752][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:26,662][__main__][INFO] - Iteration 306 took 23s (41.27% Gen, 54.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 31s. Estimated total time: 19h 57m 8s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 31s. [2025-11-13 09:56:26,664][__main__][INFO] - Starting iteration 306. [2025-11-13 09:56:26,667][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:26,667][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:36,237][__main__][INFO] - Number of regex retries in iteration 306: 0 [2025-11-13 09:56:36,238][__main__][INFO] - agents played in iteration 306 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:56:36,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:36,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:36,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:36,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:36,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:36,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:37,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:37,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:38,132][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:38,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:39,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:40,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:40,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:41,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:41,403][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:41,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:42,060][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:43,044][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:43,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:43,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:44,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:44,349][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:44,677][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:45,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:45,986][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:46,639][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:47,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:48,659][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:49,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:49,372][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:49,373][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:50,269][__main__][INFO] - Iteration 307 took 23s (40.55% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 9s. Estimated total time: 19h 40m 10s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 41s. [2025-11-13 09:56:50,271][__main__][INFO] - Starting iteration 307. [2025-11-13 09:56:50,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:50,275][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:00,171][__main__][INFO] - Number of regex retries in iteration 307: 0 [2025-11-13 09:57:00,172][__main__][INFO] - agents played in iteration 307 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:57:00,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:00,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:00,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:00,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:00,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:00,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:01,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:02,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:02,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:03,041][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:03,369][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:03,698][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:04,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:04,353][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:04,681][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:05,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:05,333][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:05,660][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:05,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:06,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:06,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:06,969][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:07,296][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:07,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:08,278][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:08,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:08,933][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:09,265][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:09,593][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:09,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:10,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:10,576][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:11,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:11,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:12,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:13,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:13,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:13,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:14,227][__main__][INFO] - Iteration 308 took 23s (41.32% Gen, 54.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 14s. Estimated total time: 19h 57m 39s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 55s, 500 more iterations: 3h 19m 36s. [2025-11-13 09:57:14,229][__main__][INFO] - Starting iteration 308. [2025-11-13 09:57:14,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:14,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:23,846][__main__][INFO] - Number of regex retries in iteration 308: 0 [2025-11-13 09:57:23,846][__main__][INFO] - agents played in iteration 308 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:57:24,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:24,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:25,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:25,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:26,075][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:26,403][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:26,730][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:27,383][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:27,711][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:28,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:28,364][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:28,692][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:29,024][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:30,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:30,333][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:30,661][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:30,989][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:31,315][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:31,641][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:32,948][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:33,275][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:33,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:33,930][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:34,912][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:35,241][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:35,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:36,306][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:37,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:37,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:37,021][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:37,908][__main__][INFO] - Iteration 309 took 23s (40.60% Gen, 55.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 2s. Estimated total time: 19h 43m 51s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 18s. [2025-11-13 09:57:37,911][__main__][INFO] - Starting iteration 309. [2025-11-13 09:57:37,913][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:37,914][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:47,138][__main__][INFO] - Number of regex retries in iteration 309: 0 [2025-11-13 09:57:47,139][__main__][INFO] - agents played in iteration 309 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:57:47,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:47,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:47,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:47,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:47,702][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:47,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:48,714][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:49,043][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:49,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:49,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:50,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:50,351][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:50,679][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:51,005][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:51,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:51,659][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:51,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:52,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:52,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:53,295][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:53,621][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:53,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:54,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:54,929][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:55,255][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:55,582][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:55,909][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:56,236][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:56,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:56,896][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:57,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:57,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:57,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:58,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:58,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:59,576][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:00,279][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:00,280][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:00,282][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:01,208][__main__][INFO] - Iteration 310 took 23s (39.60% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 33s. Estimated total time: 19h 24m 45s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 7s. [2025-11-13 09:58:01,210][__main__][INFO] - Starting iteration 310. [2025-11-13 09:58:01,213][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:58:01,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:09,798][__main__][INFO] - Number of regex retries in iteration 310: 0 [2025-11-13 09:58:09,798][__main__][INFO] - agents played in iteration 310 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:58:10,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:10,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:10,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:10,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:10,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:10,341][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:11,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:11,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:11,660][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:11,987][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:12,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:13,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:13,960][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:14,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:14,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:14,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:15,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:15,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:15,927][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:16,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:16,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:16,912][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:17,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:18,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:18,553][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:18,884][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:19,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:19,541][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:19,870][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:20,196][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:20,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:21,174][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:21,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:22,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:22,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:22,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:22,925][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:24,701][__main__][INFO] - Iteration 311 took 23s (36.55% Gen, 55.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 39m 52s. Estimated total time: 19h 34m 28s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 44s. [2025-11-13 09:58:24,704][__main__][INFO] - Starting iteration 311. [2025-11-13 09:58:24,706][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:24,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:34,145][__main__][INFO] - Number of regex retries in iteration 311: 0 [2025-11-13 09:58:34,146][__main__][INFO] - agents played in iteration 311 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:58:34,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:34,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:34,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:34,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:34,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:34,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:35,386][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:35,683][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:36,011][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:36,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:36,668][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:37,653][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:38,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:38,961][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:39,289][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:39,617][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:39,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:40,599][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:40,927][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:41,255][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:41,912][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:42,239][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:42,566][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:43,218][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:43,873][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:44,528][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:44,854][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:45,508][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:45,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:46,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:47,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:47,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:47,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:48,210][__main__][INFO] - Iteration 312 took 23s (40.15% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 15s. Estimated total time: 19h 35m 14s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 52s. [2025-11-13 09:58:48,213][__main__][INFO] - Starting iteration 312. [2025-11-13 09:58:48,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:48,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:57,170][__main__][INFO] - Number of regex retries in iteration 312: 0 [2025-11-13 09:58:57,171][__main__][INFO] - agents played in iteration 312 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:58:57,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:57,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:57,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:57,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:57,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:57,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:58,772][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:59,099][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:59,750][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:00,076][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:00,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:01,065][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:01,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:02,058][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:02,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:02,714][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:03,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:03,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:04,677][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:05,006][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:05,333][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:05,660][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:05,987][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:06,641][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:06,969][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:07,296][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:07,949][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:08,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:08,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:08,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:09,649][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:10,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:10,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:10,357][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:11,248][__main__][INFO] - Iteration 313 took 23s (38.88% Gen, 57.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 18s. Estimated total time: 19h 11m 40s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 56s. [2025-11-13 09:59:11,250][__main__][INFO] - Starting iteration 313. [2025-11-13 09:59:11,252][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:11,253][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:20,986][__main__][INFO] - Number of regex retries in iteration 313: 0 [2025-11-13 09:59:20,986][__main__][INFO] - agents played in iteration 313 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:59:21,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:21,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:21,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:21,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:21,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:21,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:22,551][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:23,206][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:23,860][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:24,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:24,839][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:25,166][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:25,493][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:25,826][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:26,155][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:26,482][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:26,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:27,135][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:27,794][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:28,123][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:28,452][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:28,779][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:29,105][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:30,089][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:30,415][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:30,741][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:31,067][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:31,394][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:32,378][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:32,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:33,434][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:34,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:34,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:34,156][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:35,068][__main__][INFO] - Iteration 314 took 23s (40.87% Gen, 55.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 3s. Estimated total time: 19h 50m 49s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 28s. [2025-11-13 09:59:35,070][__main__][INFO] - Starting iteration 314. [2025-11-13 09:59:35,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:35,074][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:44,574][__main__][INFO] - Number of regex retries in iteration 314: 0 [2025-11-13 09:59:44,575][__main__][INFO] - agents played in iteration 314 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 09:59:45,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,132][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:45,132][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:45,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:46,155][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:46,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:46,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:47,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:47,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:48,117][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:49,098][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:49,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:50,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:50,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:50,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:51,064][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:51,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:51,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:52,050][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:52,377][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:52,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:53,685][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:54,668][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:54,994][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:55,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:55,646][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:55,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:56,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:57,049][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:57,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:57,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:57,830][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:58,796][__main__][INFO] - Iteration 315 took 23s (40.05% Gen, 55.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 0s. Estimated total time: 19h 46m 10s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 41s. [2025-11-13 09:59:58,798][__main__][INFO] - Starting iteration 315. [2025-11-13 09:59:58,802][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:58,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:08,656][__main__][INFO] - Number of regex retries in iteration 315: 0 [2025-11-13 10:00:08,657][__main__][INFO] - agents played in iteration 315 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:00:09,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:09,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:09,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:09,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:09,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:09,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:09,904][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:10,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:11,183][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:11,510][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:11,836][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:12,164][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:12,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:12,818][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:13,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:13,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:14,123][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:14,780][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:15,107][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:15,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:15,760][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:16,089][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:16,415][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:17,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:17,395][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:17,722][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:19,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:19,356][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:19,688][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:20,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:20,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:21,064][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:21,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:21,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:21,786][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:22,777][__main__][INFO] - Iteration 316 took 23s (41.10% Gen, 54.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 14s. Estimated total time: 19h 58m 48s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 48s. [2025-11-13 10:00:22,779][__main__][INFO] - Starting iteration 316. [2025-11-13 10:00:22,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:22,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:31,451][__main__][INFO] - Number of regex retries in iteration 316: 0 [2025-11-13 10:00:31,452][__main__][INFO] - agents played in iteration 316 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:00:31,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:31,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:31,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:32,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:32,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:32,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:33,054][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:33,351][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:34,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:34,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:35,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:35,637][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:35,963][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:36,290][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:36,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:36,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:37,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:37,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:38,258][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:39,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:39,893][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:40,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:40,549][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:40,875][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:41,530][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:41,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:42,513][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:42,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:43,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:44,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:44,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:44,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:44,947][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:45,927][__main__][INFO] - Iteration 317 took 23s (37.45% Gen, 58.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 20m 24s. Estimated total time: 19h 17m 20s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 53s. [2025-11-13 10:00:45,930][__main__][INFO] - Starting iteration 317. [2025-11-13 10:00:45,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:45,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:54,701][__main__][INFO] - Number of regex retries in iteration 317: 0 [2025-11-13 10:00:54,702][__main__][INFO] - agents played in iteration 317 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:00:55,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,268][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:55,269][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:55,995][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:56,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:56,617][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:57,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:57,926][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:58,257][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:58,591][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:59,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:59,577][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:00,231][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:00,557][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:00,885][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:01,537][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:01,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:02,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:02,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:02,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:03,172][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:03,499][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:03,825][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:04,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:04,482][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:04,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:05,463][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:05,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:06,117][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:06,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:07,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:07,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:07,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:07,901][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:08,871][__main__][INFO] - Iteration 318 took 22s (38.22% Gen, 57.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 9m 39s. Estimated total time: 19h 6m 59s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 9s. [2025-11-13 10:01:08,873][__main__][INFO] - Starting iteration 318. [2025-11-13 10:01:08,877][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:08,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:18,074][__main__][INFO] - Number of regex retries in iteration 318: 0 [2025-11-13 10:01:18,075][__main__][INFO] - agents played in iteration 318 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:01:18,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:18,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:18,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:18,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:18,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:18,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:19,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:19,640][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:19,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:20,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:20,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:20,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:21,279][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:22,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:23,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:25,863][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:26,843][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:27,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:27,496][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:28,806][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:29,133][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:29,460][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:29,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:30,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:31,240][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:31,242][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:31,244][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:32,217][__main__][INFO] - Iteration 319 took 23s (39.40% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 22s. Estimated total time: 19h 27m 5s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 30s. [2025-11-13 10:01:32,219][__main__][INFO] - Starting iteration 319. [2025-11-13 10:01:32,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:32,223][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:41,737][__main__][INFO] - Number of regex retries in iteration 319: 0 [2025-11-13 10:01:41,738][__main__][INFO] - agents played in iteration 319 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:01:42,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:42,290][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:43,024][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:43,322][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:43,649][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:43,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:44,631][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:44,961][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:45,612][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:45,940][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:46,266][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:46,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:46,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:47,247][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:47,901][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:48,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:48,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:48,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:49,209][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:49,864][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:50,192][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:50,518][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:51,498][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:51,826][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:52,154][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:52,481][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:52,807][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:53,136][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:53,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:54,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:54,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:54,935][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:54,937][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:55,906][__main__][INFO] - Iteration 320 took 23s (40.17% Gen, 55.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 6s. Estimated total time: 19h 44m 12s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 22s. [2025-11-13 10:01:55,909][__main__][INFO] - Starting iteration 320. [2025-11-13 10:01:55,912][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:55,912][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:05,381][__main__][INFO] - Number of regex retries in iteration 320: 0 [2025-11-13 10:02:05,381][__main__][INFO] - agents played in iteration 320 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:02:05,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:05,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:05,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:05,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:05,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:05,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:07,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:07,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:07,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:08,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:08,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:10,279][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:10,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:10,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:11,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:12,244][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:12,571][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:13,225][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:13,551][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:13,878][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:14,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:14,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:15,840][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:16,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:16,820][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:17,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:17,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:18,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:18,592][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:18,594][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:20,472][__main__][INFO] - Iteration 321 took 24s (38.55% Gen, 53.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 30s. Estimated total time: 20h 28m 1s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 56s, 500 more iterations: 3h 24m 40s. [2025-11-13 10:02:20,474][__main__][INFO] - Starting iteration 321. [2025-11-13 10:02:20,477][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:20,478][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:30,065][__main__][INFO] - Number of regex retries in iteration 321: 0 [2025-11-13 10:02:30,066][__main__][INFO] - agents played in iteration 321 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:02:30,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:30,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:31,645][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:31,971][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:32,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:32,624][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:32,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:33,279][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:34,259][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:34,914][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:35,565][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:36,218][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:37,200][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:37,525][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:38,181][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:39,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:39,489][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:40,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:40,471][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:40,798][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:41,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:42,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:43,240][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:43,241][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:43,243][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:44,184][__main__][INFO] - Iteration 322 took 23s (40.44% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 28s. Estimated total time: 19h 45m 23s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 33s. [2025-11-13 10:02:44,186][__main__][INFO] - Starting iteration 322. [2025-11-13 10:02:44,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:44,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:53,202][__main__][INFO] - Number of regex retries in iteration 322: 0 [2025-11-13 10:02:53,202][__main__][INFO] - agents played in iteration 322 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:02:53,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:54,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:54,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:54,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:54,095][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:54,095][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:54,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:55,450][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:55,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:56,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:58,062][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:58,723][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:59,049][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:59,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:00,355][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:00,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:01,008][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:01,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:02,314][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:02,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:02,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:03,296][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:03,621][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:03,948][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:04,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:05,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:05,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:06,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:06,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:06,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:07,640][__main__][INFO] - Iteration 323 took 23s (38.43% Gen, 57.51% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 16s. Estimated total time: 19h 32m 35s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 25s. [2025-11-13 10:03:07,643][__main__][INFO] - Starting iteration 323. [2025-11-13 10:03:07,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:07,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:16,984][__main__][INFO] - Number of regex retries in iteration 323: 0 [2025-11-13 10:03:16,984][__main__][INFO] - agents played in iteration 323 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:03:17,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,540][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:17,541][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:18,271][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:18,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:19,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:19,553][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:19,879][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:20,208][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:20,534][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:20,862][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:21,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:21,842][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:22,167][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:22,822][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:23,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:23,800][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:24,127][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:24,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:24,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:25,105][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:25,432][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:26,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:26,413][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:27,067][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:27,392][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:28,370][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:28,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:29,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:30,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:30,170][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:30,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:31,146][__main__][INFO] - Iteration 324 took 23s (39.73% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 35m 22s. Estimated total time: 19h 35m 4s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 50s. [2025-11-13 10:03:31,149][__main__][INFO] - Starting iteration 324. [2025-11-13 10:03:31,155][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:31,156][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:40,703][__main__][INFO] - Number of regex retries in iteration 324: 0 [2025-11-13 10:03:40,704][__main__][INFO] - agents played in iteration 324 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:03:41,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:41,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:41,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:41,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:41,269][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:41,270][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:42,003][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:42,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:42,954][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:43,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:44,589][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:45,895][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:46,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:46,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:46,875][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:47,855][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:48,509][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:49,165][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:49,821][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:50,147][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:51,453][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:51,779][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:52,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:52,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:53,162][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:53,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:53,896][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:53,898][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:54,847][__main__][INFO] - Iteration 325 took 23s (40.30% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 32s. Estimated total time: 19h 44m 38s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 26s. [2025-11-13 10:03:54,850][__main__][INFO] - Starting iteration 325. [2025-11-13 10:03:54,853][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:54,854][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:04,565][__main__][INFO] - Number of regex retries in iteration 325: 0 [2025-11-13 10:04:04,565][__main__][INFO] - agents played in iteration 325 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:04:05,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:05,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:05,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:05,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:05,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:05,124][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:06,485][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:06,811][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:07,137][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:07,464][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:07,791][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:08,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:08,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:08,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:09,429][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:10,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:10,740][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:11,069][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:11,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:11,721][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:12,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:13,030][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:13,358][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:14,009][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:14,339][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:14,990][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:15,317][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:15,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:15,974][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:16,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:17,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:17,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:17,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:17,754][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:18,663][__main__][INFO] - Iteration 326 took 23s (40.78% Gen, 55.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 2s. Estimated total time: 19h 50m 32s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 25s. [2025-11-13 10:04:18,665][__main__][INFO] - Starting iteration 326. [2025-11-13 10:04:18,668][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:18,668][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:27,534][__main__][INFO] - Number of regex retries in iteration 326: 0 [2025-11-13 10:04:27,535][__main__][INFO] - agents played in iteration 326 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:04:27,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:28,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:28,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:28,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:28,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:28,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:29,865][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:30,519][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:30,847][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:31,175][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:31,501][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:31,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:32,158][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:34,442][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:34,769][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:35,099][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:35,428][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:35,754][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:36,409][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:37,061][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:37,714][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:38,041][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:38,368][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:38,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:39,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:39,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:40,398][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:41,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:41,130][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:41,132][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:42,028][__main__][INFO] - Iteration 327 took 23s (37.95% Gen, 58.21% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 10s. Estimated total time: 19h 28m 3s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 40s. [2025-11-13 10:04:42,030][__main__][INFO] - Starting iteration 327. [2025-11-13 10:04:42,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:42,033][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:51,896][__main__][INFO] - Number of regex retries in iteration 327: 0 [2025-11-13 10:04:51,897][__main__][INFO] - agents played in iteration 327 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:04:52,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:52,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:52,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:52,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:52,452][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:52,453][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:53,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:53,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:53,833][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:54,166][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:54,494][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:54,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:55,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:55,802][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:56,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:56,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:57,113][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:57,764][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:58,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:58,418][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:58,748][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:59,075][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:59,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:00,380][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:00,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:01,040][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:02,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:02,344][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:02,676][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:03,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:04,397][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:05,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:05,117][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:05,119][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:06,015][__main__][INFO] - Iteration 328 took 23s (41.13% Gen, 55.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 52s. Estimated total time: 19h 59m 8s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 51s. [2025-11-13 10:05:06,017][__main__][INFO] - Starting iteration 328. [2025-11-13 10:05:06,020][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:06,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:14,704][__main__][INFO] - Number of regex retries in iteration 328: 0 [2025-11-13 10:05:14,705][__main__][INFO] - agents played in iteration 328 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:05:15,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:15,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:15,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:15,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:15,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:15,261][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:16,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:17,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:17,606][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:17,932][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:18,261][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:18,586][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:18,913][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:19,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:20,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:20,878][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:21,206][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:21,532][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:21,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:22,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:22,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:23,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:23,826][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:24,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:24,477][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:24,803][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:25,135][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:25,465][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:25,797][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:26,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:26,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:27,221][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:27,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:27,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:27,947][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:28,859][__main__][INFO] - Iteration 329 took 22s (38.02% Gen, 57.98% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 0m 20s. Estimated total time: 19h 1m 59s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 19s. [2025-11-13 10:05:28,861][__main__][INFO] - Starting iteration 329. [2025-11-13 10:05:28,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:28,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:37,631][__main__][INFO] - Number of regex retries in iteration 329: 0 [2025-11-13 10:05:37,631][__main__][INFO] - agents played in iteration 329 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:05:38,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,200][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:38,200][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:39,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:40,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:40,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:40,891][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:41,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:41,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:41,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:42,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:43,185][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:43,514][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:43,840][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:44,167][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:44,495][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:45,150][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:45,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:45,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:46,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:46,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:46,789][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:47,116][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:47,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:48,423][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:48,749][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:49,079][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:49,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:50,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:50,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:50,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:50,867][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:51,814][__main__][INFO] - Iteration 330 took 22s (38.19% Gen, 57.67% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 5m 32s. Estimated total time: 19h 7m 35s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 15s. [2025-11-13 10:05:51,817][__main__][INFO] - Starting iteration 330. [2025-11-13 10:05:51,820][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:51,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:01,240][__main__][INFO] - Number of regex retries in iteration 330: 0 [2025-11-13 10:06:01,241][__main__][INFO] - agents played in iteration 330 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:06:01,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:01,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:01,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:01,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:01,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:01,803][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:02,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:03,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:04,162][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:04,489][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:04,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:05,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:05,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:05,794][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:06,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:06,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:06,781][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:07,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:07,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:08,092][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:08,421][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:08,752][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:09,080][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:09,409][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:09,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:10,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:10,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:10,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:11,048][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:11,381][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:11,711][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:12,037][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:12,364][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:12,688][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:13,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:13,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:14,468][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:14,470][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:14,473][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:16,565][__main__][INFO] - Iteration 331 took 24s (38.07% Gen, 53.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 34m 48s. Estimated total time: 20h 37m 15s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 14s, 500 more iterations: 3h 26m 12s. [2025-11-13 10:06:16,567][__main__][INFO] - Starting iteration 331. [2025-11-13 10:06:16,570][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:16,570][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:25,934][__main__][INFO] - Number of regex retries in iteration 331: 0 [2025-11-13 10:06:25,935][__main__][INFO] - agents played in iteration 331 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:06:26,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:26,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:26,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:26,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:26,501][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:26,501][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:27,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:28,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:28,521][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:28,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:29,839][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:30,166][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:31,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:31,799][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:32,454][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:32,783][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:33,111][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:33,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:33,765][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:34,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:34,420][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:34,750][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:35,076][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:35,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:36,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:36,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:36,720][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:37,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:37,375][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:37,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:38,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:39,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:39,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:39,169][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:40,085][__main__][INFO] - Iteration 332 took 23s (39.82% Gen, 56.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 58s. Estimated total time: 19h 35m 49s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 58s. [2025-11-13 10:06:40,087][__main__][INFO] - Starting iteration 332. [2025-11-13 10:06:40,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:40,090][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:49,426][__main__][INFO] - Number of regex retries in iteration 332: 0 [2025-11-13 10:06:49,427][__main__][INFO] - agents played in iteration 332 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:06:49,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:49,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:49,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:50,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:50,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:50,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:50,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:51,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:51,697][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:52,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:52,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:53,010][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:53,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:53,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:53,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:54,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:54,650][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:54,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:55,308][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:55,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:56,289][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:56,943][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:57,598][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:57,924][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:58,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:58,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:58,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:59,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:59,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:00,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:00,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:01,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:01,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:02,650][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:02,651][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:02,653][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:03,589][__main__][INFO] - Iteration 333 took 23s (39.73% Gen, 56.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 46s. Estimated total time: 19h 35m 0s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 50s. [2025-11-13 10:07:03,591][__main__][INFO] - Starting iteration 333. [2025-11-13 10:07:03,594][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:03,595][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:12,957][__main__][INFO] - Number of regex retries in iteration 333: 0 [2025-11-13 10:07:12,958][__main__][INFO] - agents played in iteration 333 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:07:13,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:13,525][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:14,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:15,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:15,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:16,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:16,863][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:17,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:17,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:18,502][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:18,829][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:19,157][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:19,485][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:20,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:20,466][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:21,771][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:22,423][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:23,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:23,403][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:23,729][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:24,055][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:24,382][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:24,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:25,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:26,154][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:26,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:26,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:27,084][__main__][INFO] - Iteration 334 took 23s (39.86% Gen, 56.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 55s. Estimated total time: 19h 34m 33s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 45s. [2025-11-13 10:07:27,087][__main__][INFO] - Starting iteration 334. [2025-11-13 10:07:27,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:27,090][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:36,008][__main__][INFO] - Number of regex retries in iteration 334: 0 [2025-11-13 10:07:36,008][__main__][INFO] - agents played in iteration 334 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:07:36,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,572][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:36,572][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:37,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:37,611][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:37,939][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:38,272][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:38,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:39,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:39,913][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:40,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:40,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:41,220][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:41,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:41,874][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:42,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:43,185][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:43,510][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:43,839][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:44,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:44,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:45,147][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:45,476][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:46,131][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:46,457][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:46,785][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:47,111][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:47,437][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:47,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:48,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:49,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:49,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:49,217][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:50,344][__main__][INFO] - Iteration 335 took 23s (38.34% Gen, 56.80% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 18m 45s. Estimated total time: 19h 22m 46s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 47s. [2025-11-13 10:07:50,346][__main__][INFO] - Starting iteration 335. [2025-11-13 10:07:50,349][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:50,350][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:59,310][__main__][INFO] - Number of regex retries in iteration 335: 0 [2025-11-13 10:07:59,310][__main__][INFO] - agents played in iteration 335 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:07:59,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:00,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:01,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:01,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:02,573][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:03,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:03,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:04,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:04,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:05,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:05,844][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:06,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:06,498][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:06,830][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:07,808][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:08,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:08,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:08,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:09,114][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:09,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:10,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:10,420][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:10,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:11,075][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:11,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:12,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:12,872][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:12,873][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:12,875][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:13,834][__main__][INFO] - Iteration 336 took 23s (38.15% Gen, 57.76% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 52s. Estimated total time: 19h 34m 16s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 42s. [2025-11-13 10:08:13,837][__main__][INFO] - Starting iteration 336. [2025-11-13 10:08:13,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:13,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:23,217][__main__][INFO] - Number of regex retries in iteration 336: 0 [2025-11-13 10:08:23,217][__main__][INFO] - agents played in iteration 336 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:08:23,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:23,784][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:25,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:25,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:25,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:26,118][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:26,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:26,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:27,100][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:27,428][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:27,755][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:28,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:28,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:28,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:29,068][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:29,722][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:30,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:30,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:31,357][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:32,012][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:32,341][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:33,006][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:33,334][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:33,990][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:34,317][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:34,650][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:34,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:35,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:36,451][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:36,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:36,454][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:37,429][__main__][INFO] - Iteration 337 took 23s (39.75% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 43s. Estimated total time: 19h 39m 32s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 35s. [2025-11-13 10:08:37,431][__main__][INFO] - Starting iteration 337. [2025-11-13 10:08:37,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:37,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:46,659][__main__][INFO] - Number of regex retries in iteration 337: 0 [2025-11-13 10:08:46,660][__main__][INFO] - agents played in iteration 337 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:08:47,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:47,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:47,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:47,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:47,228][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:47,228][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:47,968][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:48,265][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:48,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:48,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:49,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:49,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:50,224][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:50,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:50,880][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:51,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:51,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:52,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:52,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:52,843][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:53,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:53,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:53,822][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:54,149][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:54,809][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:55,140][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:55,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:56,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:56,450][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:57,110][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:57,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:58,098][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:58,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:59,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:59,841][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:59,842][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:59,844][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:00,736][__main__][INFO] - Iteration 338 took 23s (39.59% Gen, 56.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 55s. Estimated total time: 19h 25m 7s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 11s. [2025-11-13 10:09:00,738][__main__][INFO] - Starting iteration 338. [2025-11-13 10:09:00,741][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:00,741][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:10,450][__main__][INFO] - Number of regex retries in iteration 338: 0 [2025-11-13 10:09:10,451][__main__][INFO] - agents played in iteration 338 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:09:10,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:10,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:10,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:11,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:11,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:11,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:11,743][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:12,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:12,696][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:13,021][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:13,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:13,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:14,002][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:14,658][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:14,985][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:15,311][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:15,639][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:15,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:16,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:16,950][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:17,609][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:17,936][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:18,590][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:19,248][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:19,576][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:19,905][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:20,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:20,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:21,216][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:21,548][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:21,878][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:22,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:22,912][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:23,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:23,617][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:23,619][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:24,709][__main__][INFO] - Iteration 339 took 23s (40.51% Gen, 54.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 52s. Estimated total time: 19h 58m 27s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 44s. [2025-11-13 10:09:24,711][__main__][INFO] - Starting iteration 339. [2025-11-13 10:09:24,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:24,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:33,705][__main__][INFO] - Number of regex retries in iteration 339: 0 [2025-11-13 10:09:33,706][__main__][INFO] - agents played in iteration 339 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:09:34,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:34,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:35,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:35,975][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:36,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:36,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:37,939][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:38,593][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:38,919][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:39,248][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:39,576][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:39,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:40,228][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:40,553][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:40,881][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:41,207][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:41,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:41,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:42,188][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:42,516][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:43,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:43,837][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:44,817][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:45,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:45,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:46,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:46,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:46,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:46,889][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:47,884][__main__][INFO] - Iteration 340 took 23s (38.80% Gen, 56.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 12m 34s. Estimated total time: 19h 18m 33s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 5s. [2025-11-13 10:09:47,886][__main__][INFO] - Starting iteration 340. [2025-11-13 10:09:47,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:47,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:57,263][__main__][INFO] - Number of regex retries in iteration 340: 0 [2025-11-13 10:09:57,264][__main__][INFO] - agents played in iteration 340 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:09:57,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:57,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:57,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:57,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:57,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:57,825][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:58,565][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:58,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:59,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:00,166][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:00,494][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:00,819][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:01,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:01,474][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:01,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:02,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:02,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:03,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:03,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:03,771][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:04,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:05,082][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:05,410][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:06,068][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:06,394][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:06,721][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:07,373][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:08,351][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:08,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:09,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:09,683][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:10,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:10,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:10,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:12,271][__main__][INFO] - Iteration 341 took 24s (38.44% Gen, 53.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 43s. Estimated total time: 20h 19m 6s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 38s, 500 more iterations: 3h 23m 11s. [2025-11-13 10:10:12,274][__main__][INFO] - Starting iteration 341. [2025-11-13 10:10:12,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:12,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:21,324][__main__][INFO] - Number of regex retries in iteration 341: 0 [2025-11-13 10:10:21,325][__main__][INFO] - agents played in iteration 341 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:10:21,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:21,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:22,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:22,919][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:23,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:23,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:24,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:24,880][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:25,207][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:25,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:26,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:26,522][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:26,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:27,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:27,509][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:27,836][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:28,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:28,488][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:28,814][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:29,146][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:29,478][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:30,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:30,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:30,784][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:31,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:31,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:31,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:32,415][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:32,740][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:33,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:33,785][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:34,499][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:34,501][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:34,502][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:35,466][__main__][INFO] - Iteration 342 took 23s (39.02% Gen, 56.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 43s. Estimated total time: 19h 19m 29s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 14s. [2025-11-13 10:10:35,468][__main__][INFO] - Starting iteration 342. [2025-11-13 10:10:35,471][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:35,471][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:44,605][__main__][INFO] - Number of regex retries in iteration 342: 0 [2025-11-13 10:10:44,605][__main__][INFO] - agents played in iteration 342 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:10:45,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,167][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:45,168][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:46,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:46,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:46,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:47,207][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:47,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:48,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:48,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:48,843][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:49,171][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:49,831][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:50,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:50,816][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:51,140][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:51,794][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:52,126][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:52,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:52,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:53,109][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:53,766][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:54,093][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:54,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:55,071][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:55,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:55,725][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:56,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:56,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:57,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:58,155][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:58,157][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:58,159][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:59,120][__main__][INFO] - Iteration 343 took 23s (38.62% Gen, 57.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 35m 19s. Estimated total time: 19h 42m 29s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 4s. [2025-11-13 10:10:59,122][__main__][INFO] - Starting iteration 343. [2025-11-13 10:10:59,125][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:59,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:08,829][__main__][INFO] - Number of regex retries in iteration 343: 0 [2025-11-13 10:11:08,830][__main__][INFO] - agents played in iteration 343 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:11:09,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,388][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:09,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:10,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:10,430][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:11,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:11,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:11,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:12,723][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:13,376][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:13,702][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:14,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:14,359][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:14,688][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:15,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:16,980][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:17,307][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:17,635][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:17,961][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:18,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:18,939][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:19,267][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:19,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:19,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:20,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:20,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:21,276][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:22,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:22,010][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:22,012][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:22,978][__main__][INFO] - Iteration 344 took 23s (40.68% Gen, 55.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 9s. Estimated total time: 19h 52m 43s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 45s, 500 more iterations: 3h 18m 47s. [2025-11-13 10:11:22,980][__main__][INFO] - Starting iteration 344. [2025-11-13 10:11:22,984][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:22,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:32,737][__main__][INFO] - Number of regex retries in iteration 344: 0 [2025-11-13 10:11:32,737][__main__][INFO] - agents played in iteration 344 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:11:33,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:33,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:33,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:33,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:33,305][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:33,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:34,665][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:34,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:35,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:35,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:36,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:36,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:36,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:37,611][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:37,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:38,266][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:38,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:38,920][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:39,574][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:39,902][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:40,230][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:40,563][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:40,889][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:41,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:41,865][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:42,519][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:42,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:43,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:43,503][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:43,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:44,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:45,199][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:45,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:45,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:45,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:46,886][__main__][INFO] - Iteration 345 took 23s (40.80% Gen, 55.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 11s. Estimated total time: 19h 55m 9s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 11s. [2025-11-13 10:11:46,888][__main__][INFO] - Starting iteration 345. [2025-11-13 10:11:46,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:46,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:56,243][__main__][INFO] - Number of regex retries in iteration 345: 0 [2025-11-13 10:11:56,244][__main__][INFO] - agents played in iteration 345 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:11:56,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:56,808][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:57,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:57,835][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:58,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:58,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:59,142][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:59,802][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:00,789][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:01,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:02,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:03,079][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:03,738][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:04,066][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:04,397][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:04,721][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:05,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:06,026][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:06,356][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:06,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:07,009][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:07,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:08,698][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:09,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:09,415][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:09,417][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:10,353][__main__][INFO] - Iteration 346 took 23s (39.86% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 49s. Estimated total time: 19h 33m 10s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 31s. [2025-11-13 10:12:10,356][__main__][INFO] - Starting iteration 346. [2025-11-13 10:12:10,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:10,359][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:20,131][__main__][INFO] - Number of regex retries in iteration 346: 0 [2025-11-13 10:12:20,132][__main__][INFO] - agents played in iteration 346 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:12:20,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:20,703][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:21,727][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:22,055][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:22,382][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:22,708][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:23,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:23,363][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:23,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:24,999][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:25,327][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:25,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:25,984][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:26,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:26,640][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:26,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:27,293][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:27,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:28,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:28,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:28,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:29,920][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:30,247][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:30,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:30,902][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:31,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:31,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:32,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:33,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:33,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:33,319][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:34,245][__main__][INFO] - Iteration 347 took 23s (40.91% Gen, 55.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 35s. Estimated total time: 19h 54m 20s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 3s. [2025-11-13 10:12:34,247][__main__][INFO] - Starting iteration 347. [2025-11-13 10:12:34,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:34,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:43,939][__main__][INFO] - Number of regex retries in iteration 347: 0 [2025-11-13 10:12:43,939][__main__][INFO] - agents played in iteration 347 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:12:44,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:44,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:44,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:44,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:44,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:44,490][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:45,493][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:45,819][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:46,144][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:46,470][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:47,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:48,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:48,437][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:49,095][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:49,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:50,081][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:50,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:51,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:51,387][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:52,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:52,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:53,022][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:53,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:53,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:54,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:54,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:54,981][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:55,308][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:55,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:56,337][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:57,034][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:57,035][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:57,037][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:57,910][__main__][INFO] - Iteration 348 took 23s (40.95% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 55s. Estimated total time: 19h 43m 3s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 10s. [2025-11-13 10:12:57,912][__main__][INFO] - Starting iteration 348. [2025-11-13 10:12:57,914][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:57,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:07,249][__main__][INFO] - Number of regex retries in iteration 348: 0 [2025-11-13 10:13:07,249][__main__][INFO] - agents played in iteration 348 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:13:07,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:07,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:07,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:07,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:07,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:07,804][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:08,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:08,816][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:09,144][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:09,477][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:09,805][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:10,131][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:10,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:10,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:11,112][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:11,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:11,770][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:12,097][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:13,084][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:14,405][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:14,732][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:15,057][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:15,713][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:16,038][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:16,693][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:17,020][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:17,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:18,005][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:18,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:18,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:19,700][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:20,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:20,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:20,404][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:21,283][__main__][INFO] - Iteration 349 took 23s (39.94% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 55s. Estimated total time: 19h 28m 27s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 44s. [2025-11-13 10:13:21,285][__main__][INFO] - Starting iteration 349. [2025-11-13 10:13:21,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:21,288][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:30,824][__main__][INFO] - Number of regex retries in iteration 349: 0 [2025-11-13 10:13:30,824][__main__][INFO] - agents played in iteration 349 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:13:31,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:31,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:31,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:31,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:31,405][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:31,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:32,754][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:33,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:34,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:34,724][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:35,051][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:35,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:36,032][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:36,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:36,688][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:37,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:37,341][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:37,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:38,319][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:38,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:38,972][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:39,297][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:39,625][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:39,950][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:40,276][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:40,603][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:40,929][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:41,256][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:42,237][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:42,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:43,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:43,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:43,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:43,994][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:44,969][__main__][INFO] - Iteration 350 took 23s (40.26% Gen, 55.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 9s. Estimated total time: 19h 44m 5s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 20s. [2025-11-13 10:13:44,971][__main__][INFO] - Starting iteration 350. [2025-11-13 10:13:44,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:44,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:50,692][mllm.models.large_language_model_local][WARNING] - Response >A< did not match regex: (|), retry 1/1 [2025-11-13 10:13:53,908][__main__][INFO] - Number of regex retries in iteration 350: 1 [2025-11-13 10:13:53,908][__main__][INFO] - agents played in iteration 350 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:13:54,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:54,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:55,547][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:56,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:56,854][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:57,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:57,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:57,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:58,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:58,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:58,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:59,140][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:59,797][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:00,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:00,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:00,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:02,094][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:02,420][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:02,745][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:03,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:03,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:03,727][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:04,052][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:04,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:05,359][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:05,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:06,391][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:07,125][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:07,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:07,128][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:08,948][__main__][INFO] - Iteration 351 took 23s (37.26% Gen, 55.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 48m 20s. Estimated total time: 19h 58m 40s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 46s. [2025-11-13 10:14:08,950][__main__][INFO] - Starting iteration 351. [2025-11-13 10:14:08,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:08,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:18,848][__main__][INFO] - Number of regex retries in iteration 351: 0 [2025-11-13 10:14:18,849][__main__][INFO] - agents played in iteration 351 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:14:19,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:19,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:19,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:19,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:19,416][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:19,416][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:20,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:20,772][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:21,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:21,759][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:22,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:22,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:23,066][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:23,392][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:23,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:24,053][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:24,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:24,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:25,043][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:25,700][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:26,028][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:26,686][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:27,015][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:27,343][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:27,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:28,327][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:28,981][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:29,634][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:29,961][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:30,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:30,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:31,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:32,031][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:32,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:32,035][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:32,944][__main__][INFO] - Iteration 352 took 23s (41.24% Gen, 54.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 51s. Estimated total time: 19h 59m 35s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 55s. [2025-11-13 10:14:32,946][__main__][INFO] - Starting iteration 352. [2025-11-13 10:14:32,948][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:32,949][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:41,453][__main__][INFO] - Number of regex retries in iteration 352: 0 [2025-11-13 10:14:41,453][__main__][INFO] - agents played in iteration 352 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:14:41,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:41,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:41,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:42,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:42,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:42,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:43,082][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:44,034][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:44,690][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:45,669][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:45,996][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:46,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:47,635][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:47,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:48,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:48,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:48,946][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:49,600][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:49,926][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:50,252][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:50,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:50,909][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:51,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:51,888][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:52,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:52,543][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:52,871][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:53,199][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:53,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:54,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:54,956][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:54,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:54,960][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:55,859][__main__][INFO] - Iteration 353 took 22s (37.12% Gen, 58.95% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 54m 28s. Estimated total time: 19h 5m 35s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 55s. [2025-11-13 10:14:55,862][__main__][INFO] - Starting iteration 353. [2025-11-13 10:14:55,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:55,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:05,448][__main__][INFO] - Number of regex retries in iteration 353: 0 [2025-11-13 10:15:05,448][__main__][INFO] - agents played in iteration 353 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:15:05,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:05,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:05,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:06,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:06,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:06,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:06,724][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:07,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:07,349][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:07,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:08,334][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:09,312][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:09,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:09,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:10,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:11,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:11,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:12,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:12,600][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:12,929][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:13,256][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:13,580][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:14,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:14,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:15,546][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:16,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:16,524][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:17,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:17,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:18,607][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:18,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:18,610][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:19,512][__main__][INFO] - Iteration 354 took 23s (40.52% Gen, 55.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 55s. Estimated total time: 19h 42m 25s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 4s. [2025-11-13 10:15:19,515][__main__][INFO] - Starting iteration 354. [2025-11-13 10:15:19,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:19,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:29,112][__main__][INFO] - Number of regex retries in iteration 354: 0 [2025-11-13 10:15:29,113][__main__][INFO] - agents played in iteration 354 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:15:29,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:29,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:29,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:29,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:29,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:29,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:30,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:30,706][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:31,033][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:31,687][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:32,012][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:32,668][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:32,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:33,980][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:34,307][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:34,633][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:34,960][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:35,290][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:35,618][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:35,946][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:36,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:36,603][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:36,934][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:37,263][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:37,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:37,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:38,575][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:38,902][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:39,228][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:39,890][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:40,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:40,551][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:40,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:41,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:42,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:42,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:42,314][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:43,187][__main__][INFO] - Iteration 355 took 23s (40.53% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 38s. Estimated total time: 19h 43m 32s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 15s. [2025-11-13 10:15:43,190][__main__][INFO] - Starting iteration 355. [2025-11-13 10:15:43,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:43,193][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:52,506][__main__][INFO] - Number of regex retries in iteration 355: 0 [2025-11-13 10:15:52,506][__main__][INFO] - agents played in iteration 355 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:15:52,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:52,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:53,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:53,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:53,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:53,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:53,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:55,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:55,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:55,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:56,070][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:56,397][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:56,724][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:57,053][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:57,384][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:57,710][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:58,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:58,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:59,035][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:59,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:59,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:00,020][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:00,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:00,680][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:01,333][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:02,314][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:02,640][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:02,966][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:03,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:03,621][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:03,947][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:04,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:04,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:05,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:05,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:05,678][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:06,569][__main__][INFO] - Iteration 356 took 23s (39.84% Gen, 56.34% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 37s. Estimated total time: 19h 28m 54s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 49s. [2025-11-13 10:16:06,571][__main__][INFO] - Starting iteration 356. [2025-11-13 10:16:06,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:06,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:15,915][__main__][INFO] - Number of regex retries in iteration 356: 0 [2025-11-13 10:16:15,915][__main__][INFO] - agents played in iteration 356 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:16:16,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:16,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:16,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:16,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:16,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:16,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:17,202][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:17,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:18,478][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:18,803][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:19,459][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:20,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:20,445][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:21,099][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:21,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:22,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:22,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:23,072][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:23,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:23,729][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:24,392][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:24,720][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:26,035][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:27,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:28,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:29,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:29,083][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:29,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:29,961][__main__][INFO] - Iteration 357 took 23s (39.94% Gen, 56.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 43s. Estimated total time: 19h 29m 23s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 53s. [2025-11-13 10:16:29,964][__main__][INFO] - Starting iteration 357. [2025-11-13 10:16:29,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:29,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:39,143][__main__][INFO] - Number of regex retries in iteration 357: 0 [2025-11-13 10:16:39,143][__main__][INFO] - agents played in iteration 357 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:16:39,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:39,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:39,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:39,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:39,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:39,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:40,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:41,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:41,703][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:42,028][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:42,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:42,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:43,015][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:43,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:44,325][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:44,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:44,979][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:45,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:45,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:45,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:46,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:46,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:47,269][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:47,919][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:48,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:48,900][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:49,225][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:49,551][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:49,877][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:50,533][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:50,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:51,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:52,270][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:52,272][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:52,273][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:53,157][__main__][INFO] - Iteration 358 took 23s (39.56% Gen, 56.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 27s. Estimated total time: 19h 19m 31s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 15s. [2025-11-13 10:16:53,159][__main__][INFO] - Starting iteration 358. [2025-11-13 10:16:53,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:53,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:02,591][__main__][INFO] - Number of regex retries in iteration 358: 0 [2025-11-13 10:17:02,592][__main__][INFO] - agents played in iteration 358 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:17:03,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:03,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:03,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:03,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:03,160][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:03,160][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:03,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:04,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:04,520][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:04,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:05,176][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:05,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:05,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:06,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:06,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:07,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:07,467][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:07,794][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:08,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:08,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:09,098][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:09,427][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:09,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:10,085][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:10,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:11,069][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:11,395][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:11,721][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:12,048][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:13,034][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:14,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:15,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:15,776][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:15,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:15,780][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:16,673][__main__][INFO] - Iteration 359 took 23s (40.10% Gen, 56.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 7s. Estimated total time: 19h 35m 35s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 55s. [2025-11-13 10:17:16,676][__main__][INFO] - Starting iteration 359. [2025-11-13 10:17:16,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:16,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:25,506][__main__][INFO] - Number of regex retries in iteration 359: 0 [2025-11-13 10:17:25,507][__main__][INFO] - agents played in iteration 359 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:17:25,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:25,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:26,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:26,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:26,061][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:26,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:26,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:27,083][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:27,410][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:27,737][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:28,064][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:28,391][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:28,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:29,044][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:29,696][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:30,028][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:30,681][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:31,006][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:31,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:31,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:32,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:33,303][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:33,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:33,959][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:34,286][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:34,939][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:35,265][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:35,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:36,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:36,575][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:36,901][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:37,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:37,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:38,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:38,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:38,651][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:39,595][__main__][INFO] - Iteration 360 took 22s (38.52% Gen, 57.36% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 51m 58s. Estimated total time: 19h 5m 49s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 58s. [2025-11-13 10:17:39,597][__main__][INFO] - Starting iteration 360. [2025-11-13 10:17:39,600][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:39,600][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:48,833][__main__][INFO] - Number of regex retries in iteration 360: 0 [2025-11-13 10:17:48,833][__main__][INFO] - agents played in iteration 360 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:17:49,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:49,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:50,441][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:51,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:51,422][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:51,749][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:52,076][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:52,404][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:53,058][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:54,045][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:54,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:55,696][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:56,357][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:56,686][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:57,014][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:57,343][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:57,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:58,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:58,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:58,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:59,311][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:59,640][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:00,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:01,326][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:02,053][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:02,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:02,056][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:03,972][__main__][INFO] - Iteration 361 took 24s (37.88% Gen, 54.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 24s. Estimated total time: 20h 18m 39s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 37s, 500 more iterations: 3h 23m 6s. [2025-11-13 10:18:03,974][__main__][INFO] - Starting iteration 361. [2025-11-13 10:18:03,977][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:03,978][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:13,255][__main__][INFO] - Number of regex retries in iteration 361: 0 [2025-11-13 10:18:13,256][__main__][INFO] - agents played in iteration 361 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:18:13,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:13,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:14,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:15,168][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:15,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:16,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:16,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:16,801][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:17,130][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:17,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:17,787][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:18,115][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:18,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:18,776][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:19,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:20,091][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:20,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:21,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:21,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:22,068][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:22,396][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:23,051][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:23,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:24,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:25,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:25,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:26,486][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:26,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:26,489][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:27,532][__main__][INFO] - Iteration 362 took 23s (39.39% Gen, 56.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 11s. Estimated total time: 19h 37m 49s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 18s. [2025-11-13 10:18:27,535][__main__][INFO] - Starting iteration 362. [2025-11-13 10:18:27,537][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:27,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:36,991][__main__][INFO] - Number of regex retries in iteration 362: 0 [2025-11-13 10:18:36,992][__main__][INFO] - agents played in iteration 362 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:18:37,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:37,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:37,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:37,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:37,558][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:37,558][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:38,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:38,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:39,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:39,896][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:40,556][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:40,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:41,209][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:41,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:41,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:42,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:42,524][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:42,852][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:43,178][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:43,504][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:44,816][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:45,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:45,471][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:45,798][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:46,127][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:46,453][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:47,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:47,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:47,764][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:48,091][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:48,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:48,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:49,451][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:50,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:50,165][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:50,167][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:51,055][__main__][INFO] - Iteration 363 took 23s (40.20% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 53s. Estimated total time: 19h 35m 55s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 59s. [2025-11-13 10:18:51,057][__main__][INFO] - Starting iteration 363. [2025-11-13 10:18:51,060][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:51,061][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:00,455][__main__][INFO] - Number of regex retries in iteration 363: 0 [2025-11-13 10:19:00,456][__main__][INFO] - agents played in iteration 363 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:19:00,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:00,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:00,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:01,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:01,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:01,012][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:02,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:02,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:03,016][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:03,342][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:03,996][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:04,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:04,651][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:04,976][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:05,305][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:05,631][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:06,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:07,601][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:07,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:08,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:09,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:09,566][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:10,550][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:10,877][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:11,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:11,532][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:12,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:12,892][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:13,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:13,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:13,612][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:14,533][__main__][INFO] - Iteration 364 took 23s (40.02% Gen, 56.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 17s. Estimated total time: 19h 33m 42s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 37s. [2025-11-13 10:19:14,536][__main__][INFO] - Starting iteration 364. [2025-11-13 10:19:14,539][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:14,539][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:23,815][__main__][INFO] - Number of regex retries in iteration 364: 0 [2025-11-13 10:19:23,816][__main__][INFO] - agents played in iteration 364 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:19:24,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:24,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:24,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:24,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:24,400][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:24,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:25,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:25,765][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:26,092][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:26,418][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:26,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:27,070][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:28,050][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:28,379][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:29,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:29,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:30,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:30,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:30,997][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:31,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:31,983][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:32,310][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:32,637][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:32,965][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:33,618][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:33,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:34,273][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:34,928][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:35,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:35,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:35,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:36,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:37,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:37,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:37,326][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:38,204][__main__][INFO] - Iteration 365 took 23s (39.20% Gen, 57.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 29s. Estimated total time: 19h 43m 18s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 13s. [2025-11-13 10:19:38,206][__main__][INFO] - Starting iteration 365. [2025-11-13 10:19:38,209][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:38,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:47,849][__main__][INFO] - Number of regex retries in iteration 365: 0 [2025-11-13 10:19:47,849][__main__][INFO] - agents played in iteration 365 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:19:48,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:48,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:49,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:49,433][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:50,088][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:50,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:50,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:51,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:52,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:52,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:52,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:53,034][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:53,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:53,687][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:54,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:54,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:55,333][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:56,314][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:56,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:56,966][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:57,948][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:58,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:58,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:58,927][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:59,256][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:59,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:00,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:00,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:00,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:00,999][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:01,913][__main__][INFO] - Iteration 366 took 23s (40.67% Gen, 55.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 0s. Estimated total time: 19h 45m 12s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 32s. [2025-11-13 10:20:01,915][__main__][INFO] - Starting iteration 366. [2025-11-13 10:20:01,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:01,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:10,861][__main__][INFO] - Number of regex retries in iteration 366: 0 [2025-11-13 10:20:10,861][__main__][INFO] - agents played in iteration 366 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:20:11,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:11,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:11,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:11,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:11,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:11,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:12,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:12,811][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:13,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:13,464][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:13,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:14,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:14,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:15,096][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:15,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:15,750][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:16,076][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:16,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:16,728][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:17,054][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:17,384][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:17,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:18,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:18,367][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:18,692][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:19,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:19,678][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:20,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:20,337][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:21,319][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:21,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:22,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:22,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:23,336][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:24,033][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:24,035][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:24,038][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:24,929][__main__][INFO] - Iteration 367 took 23s (38.86% Gen, 57.26% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 54m 1s. Estimated total time: 19h 10m 36s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 46s. [2025-11-13 10:20:24,931][__main__][INFO] - Starting iteration 367. [2025-11-13 10:20:24,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:24,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:33,778][__main__][INFO] - Number of regex retries in iteration 367: 0 [2025-11-13 10:20:33,779][__main__][INFO] - agents played in iteration 367 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:20:34,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:34,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:34,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:34,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:34,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:34,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:35,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:35,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:36,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:36,357][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:36,683][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:37,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:37,337][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:37,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:38,317][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:40,293][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:40,620][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:40,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:42,601][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:43,590][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:43,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:44,251][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:44,579][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:44,905][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:45,234][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:45,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:46,284][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:46,984][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:46,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:46,987][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:47,862][__main__][INFO] - Iteration 368 took 22s (38.57% Gen, 57.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 49m 27s. Estimated total time: 19h 6m 26s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 4s. [2025-11-13 10:20:47,864][__main__][INFO] - Starting iteration 368. [2025-11-13 10:20:47,866][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:47,867][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:57,350][__main__][INFO] - Number of regex retries in iteration 368: 0 [2025-11-13 10:20:57,350][__main__][INFO] - agents played in iteration 368 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:20:57,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:57,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:57,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:57,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:57,910][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:57,910][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:58,648][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:58,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:59,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:59,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:59,928][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:00,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:00,580][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:01,234][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:01,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:02,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:03,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:03,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:03,847][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:04,501][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:04,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:05,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:06,142][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:06,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:08,107][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:09,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:09,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:10,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:10,526][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:10,528][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:11,431][__main__][INFO] - Iteration 369 took 23s (40.24% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 54s. Estimated total time: 19h 38m 16s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 22s. [2025-11-13 10:21:11,433][__main__][INFO] - Starting iteration 369. [2025-11-13 10:21:11,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:11,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:20,769][__main__][INFO] - Number of regex retries in iteration 369: 0 [2025-11-13 10:21:20,769][__main__][INFO] - agents played in iteration 369 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:21:21,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:21,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:21,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:21,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:21,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:21,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:22,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:22,362][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:23,016][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:23,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:23,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:23,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:24,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:24,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:24,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:25,302][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:26,283][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:26,609][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:27,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:27,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:28,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:28,896][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:29,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:29,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:29,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:31,189][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:31,519][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:31,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:32,174][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:32,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:33,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:33,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:33,942][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:33,944][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:34,838][__main__][INFO] - Iteration 370 took 23s (39.88% Gen, 56.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 25s. Estimated total time: 19h 30m 11s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 1s. [2025-11-13 10:21:34,841][__main__][INFO] - Starting iteration 370. [2025-11-13 10:21:34,844][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:34,844][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:43,850][__main__][INFO] - Number of regex retries in iteration 370: 0 [2025-11-13 10:21:43,851][__main__][INFO] - agents played in iteration 370 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:21:44,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:44,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:44,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:44,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:44,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:44,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:45,165][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:45,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:46,121][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:46,448][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:46,775][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:47,101][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:47,428][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:47,755][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:48,081][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:48,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:48,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:49,066][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:49,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:49,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:50,372][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:50,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:51,025][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:51,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:51,677][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:52,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:52,332][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:52,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:52,989][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:53,320][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:53,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:53,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:54,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:54,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:54,971][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:55,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:56,324][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:57,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:57,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:57,059][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:58,792][__main__][INFO] - Iteration 371 took 23s (37.61% Gen, 55.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 39m 19s. Estimated total time: 19h 57m 28s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 34s. [2025-11-13 10:21:58,794][__main__][INFO] - Starting iteration 371. [2025-11-13 10:21:58,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:21:58,798][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:08,382][__main__][INFO] - Number of regex retries in iteration 371: 0 [2025-11-13 10:22:08,383][__main__][INFO] - agents played in iteration 371 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:22:08,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:08,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:08,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:08,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:08,939][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:08,939][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:10,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:10,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:10,947][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:11,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:11,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:12,252][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:12,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:13,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:14,544][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:14,870][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:15,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:15,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:16,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:16,839][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:17,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:18,475][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:18,803][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:19,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:19,458][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:19,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:20,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:20,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:21,499][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:21,501][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:21,502][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:22,385][__main__][INFO] - Iteration 372 took 23s (40.63% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 51s. Estimated total time: 19h 39m 24s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 34s. [2025-11-13 10:22:22,387][__main__][INFO] - Starting iteration 372. [2025-11-13 10:22:22,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:22,390][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:31,815][__main__][INFO] - Number of regex retries in iteration 372: 0 [2025-11-13 10:22:31,815][__main__][INFO] - agents played in iteration 372 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:22:32,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:32,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:32,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:32,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:32,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:32,373][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:33,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:33,738][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:34,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:34,390][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:34,715][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:35,041][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:35,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:35,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:36,677][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:37,332][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:37,659][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:37,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:38,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:38,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:39,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:39,950][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:40,277][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:40,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:40,934][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:41,260][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:41,589][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:41,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:42,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:42,572][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:43,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:43,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:44,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:44,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:44,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:44,973][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:45,865][__main__][INFO] - Iteration 373 took 23s (40.15% Gen, 56.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 51s. Estimated total time: 19h 33m 48s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 38s. [2025-11-13 10:22:45,867][__main__][INFO] - Starting iteration 373. [2025-11-13 10:22:45,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:45,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:54,784][__main__][INFO] - Number of regex retries in iteration 373: 0 [2025-11-13 10:22:54,785][__main__][INFO] - agents played in iteration 373 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:22:55,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:55,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:55,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:55,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:55,347][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:55,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:56,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:56,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:56,709][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:57,036][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:57,691][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:58,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:58,998][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:59,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:00,307][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:01,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:01,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:02,278][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:03,255][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:03,587][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:03,914][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:04,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:04,574][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:04,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:05,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:06,229][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:06,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:07,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:07,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:07,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:07,980][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:08,903][__main__][INFO] - Iteration 374 took 23s (38.70% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 52m 23s. Estimated total time: 19h 11m 42s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 57s. [2025-11-13 10:23:08,905][__main__][INFO] - Starting iteration 374. [2025-11-13 10:23:08,907][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:08,908][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:18,257][__main__][INFO] - Number of regex retries in iteration 374: 0 [2025-11-13 10:23:18,258][__main__][INFO] - agents played in iteration 374 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:23:18,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:18,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:18,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:18,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:18,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:18,820][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:19,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:20,503][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:20,829][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:21,154][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:21,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:21,809][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:22,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:22,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:23,115][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:23,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:23,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:24,098][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:24,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:24,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:25,090][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:25,415][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:25,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:26,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:26,403][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:26,730][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:27,383][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:27,712][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:28,040][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:28,370][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:29,030][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:29,358][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:29,689][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:30,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:30,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:31,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:31,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:31,423][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:32,305][__main__][INFO] - Iteration 375 took 23s (39.96% Gen, 56.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 10m 13s. Estimated total time: 19h 29m 57s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 59s. [2025-11-13 10:23:32,308][__main__][INFO] - Starting iteration 375. [2025-11-13 10:23:32,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:32,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:41,851][__main__][INFO] - Number of regex retries in iteration 375: 0 [2025-11-13 10:23:41,851][__main__][INFO] - agents played in iteration 375 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:23:42,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:42,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:42,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:42,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:42,409][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:42,409][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:43,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:44,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:44,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:45,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:45,394][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:46,049][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:46,377][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:46,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:47,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:47,356][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:47,683][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:48,009][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:48,336][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:48,664][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:49,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:49,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:50,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:50,949][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:51,275][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:51,929][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:52,582][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:53,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:54,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:54,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:54,953][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:54,954][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:55,840][__main__][INFO] - Iteration 376 took 23s (40.54% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 20s. Estimated total time: 19h 36m 27s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 4s. [2025-11-13 10:23:55,842][__main__][INFO] - Starting iteration 376. [2025-11-13 10:23:55,844][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:55,845][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:05,026][__main__][INFO] - Number of regex retries in iteration 376: 0 [2025-11-13 10:24:05,027][__main__][INFO] - agents played in iteration 376 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:24:05,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:05,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:05,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:05,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:05,586][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:05,587][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:06,317][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:06,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:06,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:07,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:07,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:08,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:09,229][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:09,556][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:10,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:10,535][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:11,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:11,516][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:11,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:12,499][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:12,826][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:13,151][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:13,479][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:13,806][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:14,132][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:14,786][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:15,114][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:15,444][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:15,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:16,104][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:16,433][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:16,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:17,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:18,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:18,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:18,165][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:19,116][__main__][INFO] - Iteration 377 took 23s (39.45% Gen, 56.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 7s. Estimated total time: 19h 23m 37s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 56s. [2025-11-13 10:24:19,118][__main__][INFO] - Starting iteration 377. [2025-11-13 10:24:19,121][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:19,121][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:28,613][__main__][INFO] - Number of regex retries in iteration 377: 0 [2025-11-13 10:24:28,614][__main__][INFO] - agents played in iteration 377 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:24:29,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:29,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:29,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:29,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:29,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:29,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:29,920][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:30,543][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:30,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:31,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:31,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:31,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:32,173][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:32,827][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:33,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:33,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:34,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:35,121][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:35,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:35,775][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:36,102][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:36,758][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:38,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:38,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:39,055][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:39,716][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:40,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:41,105][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:41,794][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:41,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:41,797][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:42,681][__main__][INFO] - Iteration 378 took 23s (40.29% Gen, 55.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 10s. Estimated total time: 19h 38m 4s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 20s. [2025-11-13 10:24:42,683][__main__][INFO] - Starting iteration 378. [2025-11-13 10:24:42,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:42,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:51,375][__main__][INFO] - Number of regex retries in iteration 378: 0 [2025-11-13 10:24:51,376][__main__][INFO] - agents played in iteration 378 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:24:51,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:51,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:51,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:51,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:51,935][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:51,935][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:52,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:52,964][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:53,290][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:53,617][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:55,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:55,573][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:56,555][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:56,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:57,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:58,183][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:58,508][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:59,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:59,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:00,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:00,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:00,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:01,129][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:01,457][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:02,115][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:02,769][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:03,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:03,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:04,505][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:04,506][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:04,508][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:05,435][__main__][INFO] - Iteration 379 took 22s (38.19% Gen, 57.72% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 36m 13s. Estimated total time: 18h 57m 30s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 55s, 500 more iterations: 3h 9m 35s. [2025-11-13 10:25:05,437][__main__][INFO] - Starting iteration 379. [2025-11-13 10:25:05,440][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:05,440][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:14,720][__main__][INFO] - Number of regex retries in iteration 379: 0 [2025-11-13 10:25:14,720][__main__][INFO] - agents played in iteration 379 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:25:15,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:15,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:15,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:15,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:15,317][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:15,318][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:16,057][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:16,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:16,682][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:17,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:17,337][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:17,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:18,317][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:18,971][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:19,950][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:20,276][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:20,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:20,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:21,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:21,588][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:21,913][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:22,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:22,895][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:23,221][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:23,547][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:24,199][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:24,526][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:24,854][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:25,180][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:25,507][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:25,832][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:26,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:26,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:27,181][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:27,894][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:27,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:27,897][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:28,891][__main__][INFO] - Iteration 380 took 23s (39.57% Gen, 56.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 10m 56s. Estimated total time: 19h 32m 36s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 26s. [2025-11-13 10:25:28,893][__main__][INFO] - Starting iteration 380. [2025-11-13 10:25:28,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:28,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:37,868][__main__][INFO] - Number of regex retries in iteration 380: 0 [2025-11-13 10:25:37,868][__main__][INFO] - agents played in iteration 380 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:25:38,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:38,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:38,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:38,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:38,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:38,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:40,216][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:41,196][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:41,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:41,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:42,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:43,156][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:43,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:43,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:44,136][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:44,794][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:45,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:45,450][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:45,777][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:46,108][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:46,434][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:47,092][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:47,420][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:47,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:48,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:48,407][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:49,066][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:49,397][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:49,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:50,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:51,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:51,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:51,125][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:53,354][__main__][INFO] - Iteration 381 took 24s (36.68% Gen, 54.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 52s. Estimated total time: 20h 22m 56s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 45s, 500 more iterations: 3h 23m 49s. [2025-11-13 10:25:53,356][__main__][INFO] - Starting iteration 381. [2025-11-13 10:25:53,358][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:25:53,359][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:03,026][__main__][INFO] - Number of regex retries in iteration 381: 0 [2025-11-13 10:26:03,027][__main__][INFO] - agents played in iteration 381 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:26:03,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:03,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:04,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:05,288][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:05,941][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:06,267][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:06,593][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:06,919][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:07,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:07,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:07,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:08,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:08,554][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:08,880][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:09,533][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:10,186][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:10,511][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:10,838][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:11,491][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:12,154][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:12,484][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:12,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:13,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:13,470][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:14,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:14,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:15,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:16,216][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:16,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:16,219][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:17,179][__main__][INFO] - Iteration 382 took 23s (40.58% Gen, 55.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 36s. Estimated total time: 19h 51m 3s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 30s. [2025-11-13 10:26:17,181][__main__][INFO] - Starting iteration 382. [2025-11-13 10:26:17,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:17,184][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:26,613][__main__][INFO] - Number of regex retries in iteration 382: 0 [2025-11-13 10:26:26,613][__main__][INFO] - agents played in iteration 382 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:26:27,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:27,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:27,912][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:28,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:28,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:29,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:30,173][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:30,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:31,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:31,486][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:31,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:32,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:32,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:32,794][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:33,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:33,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:34,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:35,420][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:35,748][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:36,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:36,402][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:37,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:37,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:37,714][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:38,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:38,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:39,072][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:39,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:39,786][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:39,788][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:40,728][__main__][INFO] - Iteration 383 took 23s (40.05% Gen, 55.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 24s. Estimated total time: 19h 37m 16s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 12s. [2025-11-13 10:26:40,730][__main__][INFO] - Starting iteration 383. [2025-11-13 10:26:40,733][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:40,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:50,069][__main__][INFO] - Number of regex retries in iteration 383: 0 [2025-11-13 10:26:50,069][__main__][INFO] - agents played in iteration 383 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:26:50,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:50,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:51,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:51,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:52,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:52,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:52,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:52,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:53,307][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:53,633][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:53,961][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:54,288][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:54,615][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:54,941][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:55,270][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:55,597][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:56,581][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:57,562][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:57,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:58,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:58,868][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:59,523][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:59,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:00,178][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:00,506][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:00,832][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:01,488][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:01,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:02,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:03,282][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:03,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:03,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:04,254][__main__][INFO] - Iteration 384 took 23s (39.69% Gen, 56.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 50s. Estimated total time: 19h 36m 5s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 0s. [2025-11-13 10:27:04,256][__main__][INFO] - Starting iteration 384. [2025-11-13 10:27:04,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:04,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:13,152][__main__][INFO] - Number of regex retries in iteration 384: 0 [2025-11-13 10:27:13,153][__main__][INFO] - agents played in iteration 384 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:27:13,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,715][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:13,715][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:14,473][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:14,770][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:15,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:15,428][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:15,755][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:16,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:16,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:17,061][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:17,388][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:17,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:18,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:18,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:19,676][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:20,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:20,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:20,988][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:21,313][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:21,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:22,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:22,626][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:23,284][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:23,617][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:23,945][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:24,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:24,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:25,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:26,375][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:26,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:26,378][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:27,324][__main__][INFO] - Iteration 385 took 23s (38.55% Gen, 57.34% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 49m 39s. Estimated total time: 19h 13m 17s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 12s. [2025-11-13 10:27:27,326][__main__][INFO] - Starting iteration 385. [2025-11-13 10:27:27,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:27,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:35,886][__main__][INFO] - Number of regex retries in iteration 385: 0 [2025-11-13 10:27:35,887][__main__][INFO] - agents played in iteration 385 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:27:36,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:36,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:36,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:36,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:36,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:36,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:37,480][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:37,807][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:38,133][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:38,459][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:38,786][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:39,113][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:39,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:41,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:41,734][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:42,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:42,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:42,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:43,043][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:43,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:44,033][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:44,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:45,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:45,670][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:45,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:46,326][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:46,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:47,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:48,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:49,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:49,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:49,060][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:50,076][__main__][INFO] - Iteration 386 took 22s (37.62% Gen, 57.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 33m 20s. Estimated total time: 18h 57m 21s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 54s, 500 more iterations: 3h 9m 33s. [2025-11-13 10:27:50,078][__main__][INFO] - Starting iteration 386. [2025-11-13 10:27:50,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:50,081][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:59,225][__main__][INFO] - Number of regex retries in iteration 386: 0 [2025-11-13 10:27:59,225][__main__][INFO] - agents played in iteration 386 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:27:59,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:59,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:59,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:59,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:59,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:59,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:00,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:00,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:01,152][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:01,805][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:02,131][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:03,438][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:03,765][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:04,091][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:04,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:05,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:06,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:06,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:06,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:07,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:07,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:08,017][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:08,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:08,997][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:09,329][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:09,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:09,990][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:10,317][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:10,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:10,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:11,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:12,414][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:12,415][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:12,417][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:13,378][__main__][INFO] - Iteration 387 took 23s (39.25% Gen, 56.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 0m 29s. Estimated total time: 19h 24m 53s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 8s. [2025-11-13 10:28:13,380][__main__][INFO] - Starting iteration 387. [2025-11-13 10:28:13,383][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:13,384][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:23,156][__main__][INFO] - Number of regex retries in iteration 387: 0 [2025-11-13 10:28:23,156][__main__][INFO] - agents played in iteration 387 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:28:23,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:23,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:23,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:23,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:23,715][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:23,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:24,469][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:24,767][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:25,096][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:26,075][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:26,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:26,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:27,057][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:27,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:27,710][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:28,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:28,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:29,014][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:29,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:30,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:30,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:31,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:32,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:32,604][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:32,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:33,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:33,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:33,916][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:34,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:34,570][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:34,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:35,582][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:36,615][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:36,616][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:36,618][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:37,569][__main__][INFO] - Iteration 388 took 24s (40.41% Gen, 55.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 32s. Estimated total time: 20h 9m 20s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 18s, 500 more iterations: 3h 21m 33s. [2025-11-13 10:28:37,571][__main__][INFO] - Starting iteration 388. [2025-11-13 10:28:37,575][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:37,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:46,554][__main__][INFO] - Number of regex retries in iteration 388: 0 [2025-11-13 10:28:46,555][__main__][INFO] - agents played in iteration 388 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:28:47,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,123][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:47,123][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:48,173][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:49,153][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:49,480][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:50,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:50,787][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:51,440][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:51,766][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:52,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:52,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:53,073][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:54,055][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:55,042][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:56,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:56,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:57,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:57,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:57,984][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:58,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:59,018][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:59,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:59,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:59,763][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:00,723][__main__][INFO] - Iteration 389 took 23s (38.79% Gen, 57.06% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 52m 17s. Estimated total time: 19h 17m 28s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 54s. [2025-11-13 10:29:00,726][__main__][INFO] - Starting iteration 389. [2025-11-13 10:29:00,729][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:00,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:10,302][__main__][INFO] - Number of regex retries in iteration 389: 0 [2025-11-13 10:29:10,303][__main__][INFO] - agents played in iteration 389 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:29:10,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:10,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:11,611][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:11,907][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:12,559][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:12,886][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:13,212][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:13,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:13,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:14,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:14,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:15,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:15,831][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:16,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:17,146][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:17,473][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:18,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:19,763][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:20,091][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:20,417][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:20,744][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:21,070][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:21,397][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:22,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:22,771][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:23,524][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:23,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:23,527][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:24,619][__main__][INFO] - Iteration 390 took 23s (40.07% Gen, 55.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 55s. Estimated total time: 19h 54m 30s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 5s. [2025-11-13 10:29:24,621][__main__][INFO] - Starting iteration 390. [2025-11-13 10:29:24,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:24,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:34,207][__main__][INFO] - Number of regex retries in iteration 390: 0 [2025-11-13 10:29:34,207][__main__][INFO] - agents played in iteration 390 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:29:34,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:34,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:34,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:34,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:34,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:34,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:35,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:35,878][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:36,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:36,534][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:36,860][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:37,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:37,843][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:38,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:38,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:38,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:39,151][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:39,479][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:39,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:40,136][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:40,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:40,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:41,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:41,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:42,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:42,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:42,765][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:43,094][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:43,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:44,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:44,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:44,741][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:45,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:46,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:46,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:47,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:47,464][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:47,466][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:49,370][__main__][INFO] - Iteration 391 took 24s (38.72% Gen, 53.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 11m 18s. Estimated total time: 20h 37m 18s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 14s, 500 more iterations: 3h 26m 13s. [2025-11-13 10:29:49,372][__main__][INFO] - Starting iteration 391. [2025-11-13 10:29:49,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:29:49,375][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:58,953][__main__][INFO] - Number of regex retries in iteration 391: 0 [2025-11-13 10:29:58,953][__main__][INFO] - agents played in iteration 391 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:29:59,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:59,518][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:00,555][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:00,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:01,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:01,863][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:02,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:02,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:02,843][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:03,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:04,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:04,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:05,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:05,467][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:05,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:06,127][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:06,459][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:06,785][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:07,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:08,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:08,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:09,086][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:09,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:10,398][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:10,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:11,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:12,192][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:12,194][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:12,196][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:13,150][__main__][INFO] - Iteration 392 took 23s (40.28% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 23s. Estimated total time: 19h 48m 47s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 7s. [2025-11-13 10:30:13,152][__main__][INFO] - Starting iteration 392. [2025-11-13 10:30:13,155][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:13,156][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:22,732][__main__][INFO] - Number of regex retries in iteration 392: 0 [2025-11-13 10:30:22,733][__main__][INFO] - agents played in iteration 392 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:30:23,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:23,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:23,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:23,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:23,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:23,298][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:24,023][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:24,321][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:24,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:24,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:25,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:25,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:25,956][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:26,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:26,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:27,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:27,587][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:27,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:28,570][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:28,896][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:29,222][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:29,548][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:29,873][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:30,526][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:30,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:31,506][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:32,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:32,817][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:33,146][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:33,474][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:33,801][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:34,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:34,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:35,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:35,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:35,897][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:35,899][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:36,812][__main__][INFO] - Iteration 393 took 23s (40.48% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 6s. Estimated total time: 19h 42m 53s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 8s. [2025-11-13 10:30:36,814][__main__][INFO] - Starting iteration 393. [2025-11-13 10:30:36,818][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:36,819][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:46,195][__main__][INFO] - Number of regex retries in iteration 393: 0 [2025-11-13 10:30:46,195][__main__][INFO] - agents played in iteration 393 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:30:46,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:46,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:47,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:48,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:48,441][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:49,095][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:49,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:50,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:50,727][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:51,053][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:51,385][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:51,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:52,041][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:52,695][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:53,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:53,675][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:54,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:54,327][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:54,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:54,987][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:55,648][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:55,977][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:56,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:56,956][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:57,282][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:57,610][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:57,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:58,644][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:59,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:59,371][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:59,373][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:00,289][__main__][INFO] - Iteration 394 took 23s (39.95% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 25s. Estimated total time: 19h 33m 36s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 36s. [2025-11-13 10:31:00,291][__main__][INFO] - Starting iteration 394. [2025-11-13 10:31:00,294][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:00,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:09,698][__main__][INFO] - Number of regex retries in iteration 394: 0 [2025-11-13 10:31:09,699][__main__][INFO] - agents played in iteration 394 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:31:10,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:10,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:10,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:10,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:10,254][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:10,254][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:10,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:11,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:11,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:12,258][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:12,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:13,240][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:13,567][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:13,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:14,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:14,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:14,871][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:15,201][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:15,529][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:15,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:16,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:16,506][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:16,833][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:17,159][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:17,813][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:18,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:19,125][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:19,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:20,758][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:21,085][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:21,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:22,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:22,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:22,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:22,851][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:23,748][__main__][INFO] - Iteration 395 took 23s (40.09% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 5m 9s. Estimated total time: 19h 32m 43s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 27s. [2025-11-13 10:31:23,749][__main__][INFO] - Starting iteration 395. [2025-11-13 10:31:23,753][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:23,754][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:33,018][__main__][INFO] - Number of regex retries in iteration 395: 0 [2025-11-13 10:31:33,019][__main__][INFO] - agents played in iteration 395 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:31:33,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:33,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:33,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:33,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:33,590][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:33,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:34,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:34,942][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:35,920][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:36,899][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:37,878][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:38,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:39,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:39,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:39,842][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:40,170][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:40,496][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:40,823][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:41,150][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:41,477][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:41,803][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:42,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:42,457][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:42,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:43,436][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:43,762][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:44,416][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:44,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:45,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:46,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:46,173][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:46,175][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:47,105][__main__][INFO] - Iteration 396 took 23s (39.67% Gen, 56.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 42s. Estimated total time: 19h 27m 40s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 36s. [2025-11-13 10:31:47,107][__main__][INFO] - Starting iteration 396. [2025-11-13 10:31:47,110][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:47,111][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:56,571][__main__][INFO] - Number of regex retries in iteration 396: 0 [2025-11-13 10:31:56,572][__main__][INFO] - agents played in iteration 396 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:31:57,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:57,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:57,897][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:58,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:58,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:59,498][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:00,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:00,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:00,803][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:01,131][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:02,112][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:02,438][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:02,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:03,420][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:03,750][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:04,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:04,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:04,739][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:05,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:06,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:06,378][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:06,710][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:07,038][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:07,365][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:07,690][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:08,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:08,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:09,064][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:09,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:09,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:09,785][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:10,858][__main__][INFO] - Iteration 397 took 23s (39.83% Gen, 55.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 3s. Estimated total time: 19h 47m 25s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 54s. [2025-11-13 10:32:10,860][__main__][INFO] - Starting iteration 397. [2025-11-13 10:32:10,863][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:10,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:20,360][__main__][INFO] - Number of regex retries in iteration 397: 0 [2025-11-13 10:32:20,361][__main__][INFO] - agents played in iteration 397 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:32:20,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,920][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:20,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:21,660][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:21,957][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:22,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:22,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:23,589][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:24,241][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:24,570][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:24,895][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:25,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:25,548][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:26,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:26,853][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:27,183][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:27,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:28,170][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:28,496][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:28,824][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:29,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:29,480][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:29,812][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:30,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:30,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:30,800][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:31,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:31,458][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:31,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:32,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:32,819][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:33,543][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:33,544][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:33,545][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:34,546][__main__][INFO] - Iteration 398 took 23s (40.10% Gen, 55.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 15m 26s. Estimated total time: 19h 44m 11s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 21s. [2025-11-13 10:32:34,549][__main__][INFO] - Starting iteration 398. [2025-11-13 10:32:34,554][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:34,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:43,770][__main__][INFO] - Number of regex retries in iteration 398: 0 [2025-11-13 10:32:43,771][__main__][INFO] - agents played in iteration 398 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:32:44,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:44,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:45,073][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:45,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:46,350][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:46,678][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:47,005][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:47,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:47,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:47,984][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:48,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:48,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:48,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:49,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:49,613][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:50,269][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:50,596][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:51,248][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:51,575][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:52,554][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:52,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:54,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:55,169][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:55,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:56,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:56,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:56,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:56,937][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:57,999][__main__][INFO] - Iteration 399 took 23s (39.31% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 13s. Estimated total time: 19h 32m 21s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 23s. [2025-11-13 10:32:58,001][__main__][INFO] - Starting iteration 399. [2025-11-13 10:32:58,004][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:58,005][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:06,831][__main__][INFO] - Number of regex retries in iteration 399: 0 [2025-11-13 10:33:06,832][__main__][INFO] - agents played in iteration 399 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:33:07,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:07,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:07,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:07,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:07,741][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:07,742][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:09,108][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:09,434][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:10,090][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:10,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:11,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:11,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:12,053][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:12,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:13,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:13,687][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:14,015][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:14,348][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:14,681][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:15,013][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:15,344][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:15,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:16,002][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:16,334][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:16,659][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:17,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:17,642][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:18,296][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:18,624][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:18,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:19,672][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:20,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:20,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:20,433][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:21,438][__main__][INFO] - Iteration 400 took 23s (37.67% Gen, 58.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 2m 11s. Estimated total time: 19h 31m 43s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 17s. [2025-11-13 10:33:21,440][__main__][INFO] - Starting iteration 400. [2025-11-13 10:33:21,444][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:33:21,444][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:30,236][__main__][INFO] - Number of regex retries in iteration 400: 0 [2025-11-13 10:33:30,237][__main__][INFO] - agents played in iteration 400 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:33:30,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:30,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:30,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:30,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:30,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:30,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:31,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:32,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:32,813][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:33,141][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:33,468][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:33,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:34,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:35,106][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:35,431][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:35,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:36,088][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:36,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:37,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:37,728][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:38,054][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:38,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:38,714][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:39,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:40,353][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:41,011][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:41,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:42,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:43,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:43,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:43,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:45,315][__main__][INFO] - Iteration 401 took 23s (36.83% Gen, 55.26% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 23m 39s. Estimated total time: 19h 53m 35s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 55s. [2025-11-13 10:33:45,317][__main__][INFO] - Starting iteration 401. [2025-11-13 10:33:45,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:33:45,321][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:54,798][__main__][INFO] - Number of regex retries in iteration 401: 0 [2025-11-13 10:33:54,799][__main__][INFO] - agents played in iteration 401 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:33:55,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:55,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:55,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:55,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:55,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:55,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:56,111][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:56,408][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:56,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:57,062][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:57,389][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:57,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:58,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:59,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:00,009][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:00,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:00,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:00,991][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:01,316][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:01,647][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:01,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:02,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:02,643][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:02,968][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:03,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:03,627][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:04,609][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:04,935][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:05,263][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:05,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:06,246][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:06,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:07,276][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:08,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:08,009][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:08,010][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:08,956][__main__][INFO] - Iteration 402 took 23s (40.10% Gen, 55.89% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 31s. Estimated total time: 19h 41m 51s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 58s. [2025-11-13 10:34:08,959][__main__][INFO] - Starting iteration 402. [2025-11-13 10:34:08,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:08,963][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:17,591][__main__][INFO] - Number of regex retries in iteration 402: 0 [2025-11-13 10:34:17,592][__main__][INFO] - agents played in iteration 402 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:34:18,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:18,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:18,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:18,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:18,479][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:18,479][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:19,528][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:19,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:20,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:20,839][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:21,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:22,474][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:23,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:23,453][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:23,779][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:25,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:25,743][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:26,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:26,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:26,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:27,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:27,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:27,716][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:28,703][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:29,029][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:29,357][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:29,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:30,364][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:31,097][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:31,098][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:31,100][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:32,049][__main__][INFO] - Iteration 403 took 23s (37.37% Gen, 58.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 43m 41s. Estimated total time: 19h 14m 24s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 24s. [2025-11-13 10:34:32,052][__main__][INFO] - Starting iteration 403. [2025-11-13 10:34:32,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:32,056][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:41,799][__main__][INFO] - Number of regex retries in iteration 403: 0 [2025-11-13 10:34:41,800][__main__][INFO] - agents played in iteration 403 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:34:42,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:42,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:42,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:42,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:42,376][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:42,377][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:43,118][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:43,415][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:43,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:44,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:45,051][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:45,378][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:46,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:46,357][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:46,684][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:47,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:47,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:48,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:48,649][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:48,977][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:49,302][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:49,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:49,961][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:50,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:51,600][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:52,258][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:52,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:52,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:53,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:53,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:54,250][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:54,997][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:54,999][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:55,001][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:56,002][__main__][INFO] - Iteration 404 took 23s (40.69% Gen, 55.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 26m 18s. Estimated total time: 19h 57m 25s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 34s. [2025-11-13 10:34:56,004][__main__][INFO] - Starting iteration 404. [2025-11-13 10:34:56,008][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:56,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:05,640][__main__][INFO] - Number of regex retries in iteration 404: 0 [2025-11-13 10:35:05,641][__main__][INFO] - agents played in iteration 404 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:35:06,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:06,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:07,247][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:07,901][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:08,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:08,554][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:08,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:10,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:10,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:11,171][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:11,496][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:11,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:12,150][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:12,477][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:12,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:13,132][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:13,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:13,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:14,117][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:15,099][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:15,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:15,754][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:16,082][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:16,409][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:16,734][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:17,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:18,089][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:18,820][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:18,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:18,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:19,781][__main__][INFO] - Iteration 405 took 23s (40.52% Gen, 55.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 11s. Estimated total time: 19h 48m 42s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 7s. [2025-11-13 10:35:19,783][__main__][INFO] - Starting iteration 405. [2025-11-13 10:35:19,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:19,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:29,150][__main__][INFO] - Number of regex retries in iteration 405: 0 [2025-11-13 10:35:29,150][__main__][INFO] - agents played in iteration 405 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:35:29,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,708][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:29,709][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:30,449][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:30,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:31,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:31,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:32,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:32,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:33,690][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:34,343][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:34,670][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:35,000][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:35,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:35,988][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:36,315][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:36,974][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:37,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:37,965][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:39,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:39,607][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:39,934][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:40,587][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:40,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:41,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:42,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:42,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:42,377][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:43,383][__main__][INFO] - Iteration 406 took 23s (39.68% Gen, 56.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 7m 57s. Estimated total time: 19h 39m 51s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 38s. [2025-11-13 10:35:43,385][__main__][INFO] - Starting iteration 406. [2025-11-13 10:35:43,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:43,388][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:52,526][__main__][INFO] - Number of regex retries in iteration 406: 0 [2025-11-13 10:35:52,526][__main__][INFO] - agents played in iteration 406 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:35:52,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:53,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:53,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:53,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:53,081][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:53,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:54,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:55,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:55,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:56,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:56,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:56,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:57,389][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:57,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:59,354][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:59,681][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:00,009][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:00,337][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:00,662][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:00,990][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:01,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:01,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:01,974][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:02,307][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:02,638][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:02,969][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:03,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:03,950][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:04,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:05,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:05,721][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:05,722][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:05,724][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:06,682][__main__][INFO] - Iteration 407 took 23s (39.23% Gen, 56.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 52m 28s. Estimated total time: 19h 24m 45s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 7s. [2025-11-13 10:36:06,686][__main__][INFO] - Starting iteration 407. [2025-11-13 10:36:06,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:06,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:15,680][__main__][INFO] - Number of regex retries in iteration 407: 0 [2025-11-13 10:36:15,681][__main__][INFO] - agents played in iteration 407 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:36:16,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,240][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:16,240][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:16,990][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:18,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:18,596][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:18,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:19,577][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:19,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:20,230][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:20,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:21,212][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:21,543][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:21,871][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:22,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:22,528][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:23,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:23,510][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:23,836][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:24,164][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:24,820][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:25,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:25,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:25,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:26,135][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:26,464][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:27,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:28,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:28,897][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:28,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:28,900][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:29,887][__main__][INFO] - Iteration 408 took 23s (38.75% Gen, 56.98% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 47m 15s. Estimated total time: 19h 19m 56s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 19s. [2025-11-13 10:36:29,890][__main__][INFO] - Starting iteration 408. [2025-11-13 10:36:29,893][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:29,894][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:39,401][__main__][INFO] - Number of regex retries in iteration 408: 0 [2025-11-13 10:36:39,402][__main__][INFO] - agents played in iteration 408 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:36:39,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,975][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:39,975][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:41,014][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:41,668][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:41,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:42,320][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:43,627][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:44,279][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:44,606][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:44,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:46,242][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:46,573][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:46,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:47,562][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:47,888][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:48,878][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:49,864][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:50,190][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:50,517][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:50,843][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:51,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:51,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:52,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:52,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:52,649][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:53,627][__main__][INFO] - Iteration 409 took 23s (40.06% Gen, 55.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 41s. Estimated total time: 19h 46m 45s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 47s. [2025-11-13 10:36:53,630][__main__][INFO] - Starting iteration 409. [2025-11-13 10:36:53,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:53,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:02,609][__main__][INFO] - Number of regex retries in iteration 409: 0 [2025-11-13 10:37:02,609][__main__][INFO] - agents played in iteration 409 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:37:03,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:03,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:03,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:03,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:03,512][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:03,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:04,266][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:04,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:04,890][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:05,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:05,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:05,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:06,525][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:07,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:07,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:07,831][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:08,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:08,492][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:08,822][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:09,483][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:09,808][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:10,137][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:10,467][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:11,121][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:11,450][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:11,778][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:12,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:12,438][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:13,416][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:13,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:14,069][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:14,395][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:14,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:15,423][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:16,156][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:16,157][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:16,159][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:17,124][__main__][INFO] - Iteration 410 took 23s (38.21% Gen, 57.68% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 1m 6s. Estimated total time: 19h 34m 34s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 45s. [2025-11-13 10:37:17,126][__main__][INFO] - Starting iteration 410. [2025-11-13 10:37:17,129][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:37:17,129][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:26,195][__main__][INFO] - Number of regex retries in iteration 410: 0 [2025-11-13 10:37:26,196][__main__][INFO] - agents played in iteration 410 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:37:26,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:26,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:26,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:26,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:26,760][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:26,760][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:27,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:28,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:28,789][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:29,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:29,444][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:30,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:30,426][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:30,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:31,079][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:31,407][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:31,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:32,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:32,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:32,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:33,695][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:34,021][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:34,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:35,010][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:35,997][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:36,329][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:37,309][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:37,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:38,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:39,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:39,392][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:39,394][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:41,315][__main__][INFO] - Iteration 411 took 24s (37.49% Gen, 54.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 35m 29s. Estimated total time: 20h 9m 21s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 18s, 500 more iterations: 3h 21m 33s. [2025-11-13 10:37:41,317][__main__][INFO] - Starting iteration 411. [2025-11-13 10:37:41,321][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:37:41,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:50,811][__main__][INFO] - Number of regex retries in iteration 411: 0 [2025-11-13 10:37:50,812][__main__][INFO] - agents played in iteration 411 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:37:51,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:51,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:51,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:51,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:51,375][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:51,375][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:52,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:52,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:52,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:53,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:53,398][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:54,051][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:54,380][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:54,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:55,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:55,689][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:56,015][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:56,341][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:56,998][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:57,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:57,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:58,305][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:59,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:00,283][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:00,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:01,269][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:01,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:02,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:03,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:03,985][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:03,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:03,989][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:04,950][__main__][INFO] - Iteration 412 took 23s (40.16% Gen, 55.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 7m 16s. Estimated total time: 19h 41m 32s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 55s. [2025-11-13 10:38:04,953][__main__][INFO] - Starting iteration 412. [2025-11-13 10:38:04,955][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:04,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:14,673][__main__][INFO] - Number of regex retries in iteration 412: 0 [2025-11-13 10:38:14,673][__main__][INFO] - agents played in iteration 412 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:38:15,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:15,233][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:15,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:16,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:16,603][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:16,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:17,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:18,240][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:18,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:18,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:19,225][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:19,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:20,209][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:20,536][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:21,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:22,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:22,496][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:22,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:23,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:24,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:24,467][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:25,117][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:25,442][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:25,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:26,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:27,145][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:27,888][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:27,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:27,891][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:29,159][__main__][INFO] - Iteration 413 took 24s (40.15% Gen, 54.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 35m 33s. Estimated total time: 20h 10m 13s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 20s, 500 more iterations: 3h 21m 42s. [2025-11-13 10:38:29,162][__main__][INFO] - Starting iteration 413. [2025-11-13 10:38:29,166][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:29,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:38,729][__main__][INFO] - Number of regex retries in iteration 413: 0 [2025-11-13 10:38:38,730][__main__][INFO] - agents played in iteration 413 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:38:39,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:39,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:40,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:40,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:41,003][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:41,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:41,982][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:42,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:42,636][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:42,963][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:43,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:43,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:43,943][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:44,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:44,595][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:44,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:45,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:45,574][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:45,899][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:46,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:46,551][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:47,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:47,860][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:48,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:48,512][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:48,837][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:49,165][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:49,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:49,818][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:50,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:51,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:51,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:51,932][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:51,934][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:52,906][__main__][INFO] - Iteration 414 took 23s (40.28% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 59s. Estimated total time: 19h 47m 3s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 50s. [2025-11-13 10:38:52,908][__main__][INFO] - Starting iteration 414. [2025-11-13 10:38:52,911][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:52,912][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:02,223][__main__][INFO] - Number of regex retries in iteration 414: 0 [2025-11-13 10:39:02,223][__main__][INFO] - agents played in iteration 414 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:39:02,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,783][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:02,783][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:03,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:03,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:04,474][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:04,801][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:05,128][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:05,454][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:05,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:06,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:06,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:06,760][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:07,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:07,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:07,742][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:08,072][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:08,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:08,732][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:09,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:09,387][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:09,714][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:10,041][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:10,366][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:10,694][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:11,020][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:11,348][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:11,675][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:12,329][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:12,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:12,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:13,305][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:13,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:14,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:15,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:15,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:15,407][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:16,523][__main__][INFO] - Iteration 415 took 23s (39.44% Gen, 55.83% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 5m 10s. Estimated total time: 19h 40m 37s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 46s. [2025-11-13 10:39:16,525][__main__][INFO] - Starting iteration 415. [2025-11-13 10:39:16,528][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:16,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:26,090][__main__][INFO] - Number of regex retries in iteration 415: 0 [2025-11-13 10:39:26,091][__main__][INFO] - agents played in iteration 415 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:39:26,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,654][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:26,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:27,696][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:28,350][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:28,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:29,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:29,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:29,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:30,311][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:30,638][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:30,963][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:31,289][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:31,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:31,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:32,267][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:32,593][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:33,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:34,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:34,553][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:34,880][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:35,206][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:35,532][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:35,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:36,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:37,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:37,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:38,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:39,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:39,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:39,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:40,233][__main__][INFO] - Iteration 416 took 23s (40.34% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 26s. Estimated total time: 19h 45m 17s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 32s. [2025-11-13 10:39:40,235][__main__][INFO] - Starting iteration 416. [2025-11-13 10:39:40,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:40,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:49,000][__main__][INFO] - Number of regex retries in iteration 416: 0 [2025-11-13 10:39:49,001][__main__][INFO] - agents played in iteration 416 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:39:49,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:49,571][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:50,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:51,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:51,595][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:51,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:52,249][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:52,575][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:52,902][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:53,229][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:53,555][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:53,881][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:54,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:55,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:55,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:55,849][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:56,833][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:57,162][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:57,488][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:57,817][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:58,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:58,468][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:58,794][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:59,120][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:59,444][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:59,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:00,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:01,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:02,186][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:02,187][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:02,189][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:03,338][__main__][INFO] - Iteration 417 took 23s (37.93% Gen, 57.09% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 38m 48s. Estimated total time: 19h 15m 2s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 30s. [2025-11-13 10:40:03,341][__main__][INFO] - Starting iteration 417. [2025-11-13 10:40:03,345][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:03,345][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:12,867][__main__][INFO] - Number of regex retries in iteration 417: 0 [2025-11-13 10:40:12,867][__main__][INFO] - agents played in iteration 417 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:40:13,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:13,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:14,185][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:14,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:15,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:15,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:15,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:16,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:16,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:16,774][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:17,100][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:18,411][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:18,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:19,726][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:20,053][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:20,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:21,365][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:21,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:22,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:22,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:23,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:23,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:23,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:24,308][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:24,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:25,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:26,104][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:26,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:26,108][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:27,150][__main__][INFO] - Iteration 418 took 23s (40.00% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 42s. Estimated total time: 19h 50m 20s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 23s. [2025-11-13 10:40:27,152][__main__][INFO] - Starting iteration 418. [2025-11-13 10:40:27,155][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:27,156][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:36,948][__main__][INFO] - Number of regex retries in iteration 418: 0 [2025-11-13 10:40:36,948][__main__][INFO] - agents played in iteration 418 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:40:37,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,520][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:37,520][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:38,274][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:38,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:39,223][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:39,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:40,205][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:40,859][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:42,489][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:42,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:43,469][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:44,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:44,781][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:45,107][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:45,432][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:45,760][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:46,091][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:46,417][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:46,746][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:47,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:48,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:49,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:50,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:50,153][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:50,155][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:51,284][__main__][INFO] - Iteration 419 took 24s (40.59% Gen, 54.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 27s. Estimated total time: 20h 6m 29s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 12s, 500 more iterations: 3h 21m 4s. [2025-11-13 10:40:51,287][__main__][INFO] - Starting iteration 419. [2025-11-13 10:40:51,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:51,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:00,274][__main__][INFO] - Number of regex retries in iteration 419: 0 [2025-11-13 10:41:00,274][__main__][INFO] - agents played in iteration 419 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:41:00,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,831][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:00,832][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:01,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:02,542][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:03,194][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:03,520][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:03,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:04,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:05,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:05,803][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:06,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:06,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:07,111][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:07,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:07,766][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:08,093][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:08,417][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:08,750][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:09,076][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:09,404][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:09,731][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:10,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:11,044][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:11,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:12,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:12,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:13,468][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:13,470][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:13,472][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:14,438][__main__][INFO] - Iteration 420 took 23s (38.81% Gen, 57.01% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 40m 1s. Estimated total time: 19h 17m 26s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 54s. [2025-11-13 10:41:14,440][__main__][INFO] - Starting iteration 420. [2025-11-13 10:41:14,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:41:14,444][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:23,667][__main__][INFO] - Number of regex retries in iteration 420: 0 [2025-11-13 10:41:23,667][__main__][INFO] - agents played in iteration 420 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:41:24,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:24,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:24,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:24,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:24,224][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:24,224][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:24,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:25,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:25,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:25,934][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:26,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:26,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:27,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:28,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:28,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:28,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:29,204][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:29,532][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:31,175][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:31,504][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:31,832][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:32,492][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:32,819][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:33,146][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:33,478][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:33,809][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:34,137][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:34,464][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:35,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:36,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:36,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:37,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:37,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:39,019][__main__][INFO] - Iteration 421 took 24s (37.53% Gen, 54.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 51m 0s. Estimated total time: 20h 28m 50s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 57s, 500 more iterations: 3h 24m 48s. [2025-11-13 10:41:39,021][__main__][INFO] - Starting iteration 421. [2025-11-13 10:41:39,025][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:41:39,025][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:48,466][__main__][INFO] - Number of regex retries in iteration 421: 0 [2025-11-13 10:41:48,466][__main__][INFO] - agents played in iteration 421 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:41:48,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:48,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:48,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:49,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:49,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:49,032][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:49,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:50,067][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:50,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:50,722][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:51,702][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:52,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:52,682][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:53,009][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:53,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:53,665][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:53,993][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:54,324][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:54,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:54,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:55,312][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:55,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:56,289][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:56,947][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:57,604][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:58,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:58,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:58,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:59,236][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:59,892][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:00,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:00,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:01,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:01,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:01,691][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:02,640][__main__][INFO] - Iteration 422 took 23s (39.98% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 34s. Estimated total time: 19h 40m 48s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 48s. [2025-11-13 10:42:02,642][__main__][INFO] - Starting iteration 422. [2025-11-13 10:42:02,645][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:02,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:11,777][__main__][INFO] - Number of regex retries in iteration 422: 0 [2025-11-13 10:42:11,778][__main__][INFO] - agents played in iteration 422 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:42:12,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:12,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:13,401][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:13,730][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:14,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:14,709][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:15,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:15,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:15,686][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:16,012][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:16,339][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:16,665][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:16,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:17,318][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:17,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:18,298][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:18,625][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:18,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:19,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:19,929][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:20,256][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:20,582][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:21,238][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:21,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:21,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:22,553][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:23,206][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:23,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:24,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:24,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:24,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:24,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:25,906][__main__][INFO] - Iteration 423 took 23s (39.26% Gen, 56.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 44m 28s. Estimated total time: 19h 23m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 50s. [2025-11-13 10:42:25,908][__main__][INFO] - Starting iteration 423. [2025-11-13 10:42:25,911][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:25,912][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:35,151][__main__][INFO] - Number of regex retries in iteration 423: 0 [2025-11-13 10:42:35,152][__main__][INFO] - agents played in iteration 423 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:42:35,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:35,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:35,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:35,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:35,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:35,717][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:36,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:36,775][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:37,103][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:37,758][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:40,043][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:40,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:40,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:41,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:41,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:41,677][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:42,003][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:42,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:43,320][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:43,647][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:44,301][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:45,615][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:45,940][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:46,593][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:46,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:47,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:48,314][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:48,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:48,317][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:49,319][__main__][INFO] - Iteration 424 took 23s (39.47% Gen, 56.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 51m 26s. Estimated total time: 19h 30m 26s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 4s. [2025-11-13 10:42:49,321][__main__][INFO] - Starting iteration 424. [2025-11-13 10:42:49,325][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:49,325][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:58,472][__main__][INFO] - Number of regex retries in iteration 424: 0 [2025-11-13 10:42:58,472][__main__][INFO] - agents played in iteration 424 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:42:58,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:58,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:58,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:59,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:59,025][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:59,025][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:59,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:00,064][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:00,391][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:00,718][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:01,044][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:01,371][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:01,698][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:02,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:03,006][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:03,660][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:04,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:05,296][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:05,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:05,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:06,603][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:07,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:07,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:08,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:08,901][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:09,229][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:09,555][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:10,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:10,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:11,637][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:11,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:11,640][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:12,607][__main__][INFO] - Iteration 425 took 23s (39.29% Gen, 56.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 44m 47s. Estimated total time: 19h 24m 10s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 1s. [2025-11-13 10:43:12,609][__main__][INFO] - Starting iteration 425. [2025-11-13 10:43:12,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:12,613][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:22,333][__main__][INFO] - Number of regex retries in iteration 425: 0 [2025-11-13 10:43:22,334][__main__][INFO] - agents played in iteration 425 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:43:22,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:22,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:22,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:22,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:22,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:22,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:23,635][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:23,932][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:24,262][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:24,589][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:24,915][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:25,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:25,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:26,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:26,873][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:27,201][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:27,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:27,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:28,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:29,495][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:30,148][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:30,803][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:31,462][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:31,795][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:32,450][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:32,781][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:33,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:33,438][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:33,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:34,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:34,782][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:35,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:35,508][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:35,510][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:36,478][__main__][INFO] - Iteration 426 took 23s (40.73% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 33s. Estimated total time: 19h 53m 21s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 53s. [2025-11-13 10:43:36,481][__main__][INFO] - Starting iteration 426. [2025-11-13 10:43:36,484][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:36,485][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:46,112][__main__][INFO] - Number of regex retries in iteration 426: 0 [2025-11-13 10:43:46,113][__main__][INFO] - agents played in iteration 426 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:43:46,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,675][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:46,675][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:47,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:47,723][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:48,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:48,377][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:49,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:49,356][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:49,683][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:50,010][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:50,338][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:50,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:50,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:51,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:51,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:51,977][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:52,307][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:52,634][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:52,961][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:53,288][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:53,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:53,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:54,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:54,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:54,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:55,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:55,586][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:55,917][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:56,245][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:56,573][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:56,901][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:57,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:57,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:57,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:58,610][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:59,341][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:59,343][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:59,344][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:00,286][__main__][INFO] - Iteration 427 took 23s (40.44% Gen, 55.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 58s. Estimated total time: 19h 50m 9s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 21s. [2025-11-13 10:44:00,288][__main__][INFO] - Starting iteration 427. [2025-11-13 10:44:00,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:00,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:09,187][__main__][INFO] - Number of regex retries in iteration 427: 0 [2025-11-13 10:44:09,188][__main__][INFO] - agents played in iteration 427 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:44:09,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:09,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:10,861][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:11,159][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:11,812][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:12,467][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:12,794][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:13,120][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:13,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:14,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:14,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:14,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:15,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:15,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:15,749][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:16,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:16,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:16,734][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:17,063][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:17,390][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:17,717][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:18,045][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:18,373][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:18,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:19,033][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:19,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:19,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:20,015][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:20,670][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:20,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:21,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:22,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:22,822][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:22,824][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:22,826][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:23,804][__main__][INFO] - Iteration 428 took 23s (37.83% Gen, 58.00% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 55m 6s. Estimated total time: 19h 35m 40s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 56s. [2025-11-13 10:44:23,806][__main__][INFO] - Starting iteration 428. [2025-11-13 10:44:23,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:23,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:33,608][__main__][INFO] - Number of regex retries in iteration 428: 0 [2025-11-13 10:44:33,608][__main__][INFO] - agents played in iteration 428 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:44:34,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:34,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:34,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:34,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:34,167][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:34,168][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:34,921][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:35,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:36,199][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:36,852][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:37,179][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:37,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:38,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:38,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:38,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:39,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:39,479][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:39,811][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:40,139][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:40,798][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:41,455][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:41,782][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:42,110][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:42,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:43,095][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:43,421][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:43,749][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:44,078][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:44,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:44,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:45,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:45,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:46,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:46,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:46,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:46,839][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:47,794][__main__][INFO] - Iteration 429 took 23s (40.85% Gen, 55.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 16s. Estimated total time: 19h 59m 15s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 52s. [2025-11-13 10:44:47,796][__main__][INFO] - Starting iteration 429. [2025-11-13 10:44:47,798][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:47,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:57,406][__main__][INFO] - Number of regex retries in iteration 429: 0 [2025-11-13 10:44:57,406][__main__][INFO] - agents played in iteration 429 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:44:57,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,977][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:57,977][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:58,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:59,024][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:59,349][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:00,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:00,657][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:00,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:01,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:01,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:01,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:02,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:02,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:03,272][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:03,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:03,927][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:04,259][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:04,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:04,915][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:05,241][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:05,571][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:05,898][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:06,228][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:06,556][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:06,883][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:07,211][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:07,543][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:07,875][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:08,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:08,529][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:08,854][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:09,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:09,891][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:10,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:10,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:10,612][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:11,578][__main__][INFO] - Iteration 430 took 23s (40.40% Gen, 55.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 7m 40s. Estimated total time: 19h 49m 3s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 10s. [2025-11-13 10:45:11,580][__main__][INFO] - Starting iteration 430. [2025-11-13 10:45:11,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:45:11,584][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:20,931][__main__][INFO] - Number of regex retries in iteration 430: 0 [2025-11-13 10:45:20,932][__main__][INFO] - agents played in iteration 430 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:45:21,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:21,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:21,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:21,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:21,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:21,491][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:22,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:22,858][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:23,185][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:23,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:23,839][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:24,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:24,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:24,821][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:25,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:26,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:27,784][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:28,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:28,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:28,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:29,105][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:29,438][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:30,420][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:30,747][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:31,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:31,399][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:32,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:32,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:33,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:34,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:34,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:34,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:36,005][__main__][INFO] - Iteration 431 took 24s (38.27% Gen, 54.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 39m 19s. Estimated total time: 20h 21m 6s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 42s, 500 more iterations: 3h 23m 31s. [2025-11-13 10:45:36,008][__main__][INFO] - Starting iteration 431. [2025-11-13 10:45:36,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:45:36,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:45,265][__main__][INFO] - Number of regex retries in iteration 431: 0 [2025-11-13 10:45:45,265][__main__][INFO] - agents played in iteration 431 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:45:45,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:45,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:45,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:45,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:45,821][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:45,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:46,571][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:46,868][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:47,195][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:47,522][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:48,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:48,834][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:49,160][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:49,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:50,141][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:50,469][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:51,450][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:51,777][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:52,105][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:52,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:53,092][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:53,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:54,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:54,728][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:55,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:55,709][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:56,036][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:56,364][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:56,692][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:57,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:57,704][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:58,435][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:58,437][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:58,439][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:59,573][__main__][INFO] - Iteration 432 took 23s (39.27% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 55m 57s. Estimated total time: 19h 38m 7s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 21s. [2025-11-13 10:45:59,575][__main__][INFO] - Starting iteration 432. [2025-11-13 10:45:59,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:45:59,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:08,962][__main__][INFO] - Number of regex retries in iteration 432: 0 [2025-11-13 10:46:08,962][__main__][INFO] - agents played in iteration 432 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:46:09,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,527][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:09,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:10,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:10,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:10,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:11,227][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:11,555][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:11,882][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:12,210][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:12,863][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:13,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:14,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:14,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:14,828][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:15,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:15,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:16,140][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:16,473][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:16,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:17,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:17,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:17,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:18,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:18,437][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:19,091][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:19,417][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:20,070][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:20,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:21,467][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:22,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:22,190][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:22,192][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:23,213][__main__][INFO] - Iteration 433 took 23s (39.70% Gen, 55.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 12s. Estimated total time: 19h 41m 46s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 57s. [2025-11-13 10:46:23,215][__main__][INFO] - Starting iteration 433. [2025-11-13 10:46:23,218][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:23,219][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:32,690][__main__][INFO] - Number of regex retries in iteration 433: 0 [2025-11-13 10:46:32,690][__main__][INFO] - agents played in iteration 433 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:46:33,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,249][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:33,251][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:34,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:34,308][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:34,635][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:34,961][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:35,289][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:35,944][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:36,271][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:36,598][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:36,926][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:37,254][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:37,582][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:37,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:38,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:38,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:39,216][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:39,881][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:40,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:40,870][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:41,200][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:41,532][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:41,861][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:42,190][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:42,848][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:43,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:43,510][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:43,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:44,164][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:44,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:45,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:45,937][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:45,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:45,940][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:46,918][__main__][INFO] - Iteration 434 took 23s (39.96% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 5s. Estimated total time: 19h 45m 3s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 30s. [2025-11-13 10:46:46,920][__main__][INFO] - Starting iteration 434. [2025-11-13 10:46:46,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:46,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:55,701][__main__][INFO] - Number of regex retries in iteration 434: 0 [2025-11-13 10:46:55,701][__main__][INFO] - agents played in iteration 434 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:46:56,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:56,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:57,018][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:57,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:57,971][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:58,297][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:58,624][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:58,951][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:59,278][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:00,596][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:00,922][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:01,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:01,576][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:01,903][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:02,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:02,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:03,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:03,865][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:04,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:04,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:04,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:05,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:05,506][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:05,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:06,164][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:06,495][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:07,148][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:07,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:08,171][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:08,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:08,894][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:08,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:09,892][__main__][INFO] - Iteration 435 took 22s (38.21% Gen, 57.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 25m 4s. Estimated total time: 19h 8m 24s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 24s. [2025-11-13 10:47:09,894][__main__][INFO] - Starting iteration 435. [2025-11-13 10:47:09,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:09,898][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:19,291][__main__][INFO] - Number of regex retries in iteration 435: 0 [2025-11-13 10:47:19,292][__main__][INFO] - agents played in iteration 435 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:47:19,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:19,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:20,596][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:20,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:21,219][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:21,546][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:21,873][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:22,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:22,529][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:23,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:23,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:24,169][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:24,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:24,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:25,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:25,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:25,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:26,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:26,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:26,789][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:27,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:27,779][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:28,440][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:29,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:29,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:29,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:30,075][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:30,730][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:31,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:31,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:32,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:32,515][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:32,517][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:33,553][__main__][INFO] - Iteration 436 took 23s (39.71% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 7s. Estimated total time: 19h 42m 51s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 8s. [2025-11-13 10:47:33,555][__main__][INFO] - Starting iteration 436. [2025-11-13 10:47:33,559][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:33,559][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:43,093][__main__][INFO] - Number of regex retries in iteration 436: 0 [2025-11-13 10:47:43,094][__main__][INFO] - agents played in iteration 436 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:47:43,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:43,663][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:44,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:45,353][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:45,680][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:46,006][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:46,333][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:47,317][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:47,644][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:47,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:48,301][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:48,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:48,964][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:49,630][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:49,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:50,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:52,261][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:52,916][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:53,244][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:53,571][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:53,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:54,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:54,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:55,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:56,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:56,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:56,339][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:57,297][__main__][INFO] - Iteration 437 took 23s (40.16% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 49s. Estimated total time: 19h 46m 57s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 49s. [2025-11-13 10:47:57,299][__main__][INFO] - Starting iteration 437. [2025-11-13 10:47:57,301][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:57,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:06,885][__main__][INFO] - Number of regex retries in iteration 437: 0 [2025-11-13 10:48:06,886][__main__][INFO] - agents played in iteration 437 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:48:07,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:07,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:07,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:07,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:07,448][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:07,449][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:08,176][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:08,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:08,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:09,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:09,786][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:10,113][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:10,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:10,768][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:11,425][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:11,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:12,081][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:12,410][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:12,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:13,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:14,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:14,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:14,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:15,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:15,367][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:15,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:16,345][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:16,671][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:16,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:17,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:17,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:17,981][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:18,310][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:18,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:19,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:20,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:20,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:20,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:21,034][__main__][INFO] - Iteration 438 took 23s (40.38% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 9s. Estimated total time: 19h 46m 41s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 46s. [2025-11-13 10:48:21,037][__main__][INFO] - Starting iteration 438. [2025-11-13 10:48:21,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:21,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:29,783][__main__][INFO] - Number of regex retries in iteration 438: 0 [2025-11-13 10:48:29,784][__main__][INFO] - agents played in iteration 438 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:48:30,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,352][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:30,352][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:31,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:31,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:32,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:32,388][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:32,715][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:33,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:34,028][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:34,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:35,019][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:35,671][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:35,998][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:36,325][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:36,987][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:37,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:38,301][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:38,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:38,955][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:39,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:39,607][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:39,933][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:40,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:40,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:41,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:41,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:42,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:43,031][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:43,034][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:43,036][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:44,006][__main__][INFO] - Iteration 439 took 22s (38.07% Gen, 57.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 23m 27s. Estimated total time: 19h 8m 22s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 23s. [2025-11-13 10:48:44,008][__main__][INFO] - Starting iteration 439. [2025-11-13 10:48:44,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:44,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:52,453][__main__][INFO] - Number of regex retries in iteration 439: 0 [2025-11-13 10:48:52,454][__main__][INFO] - agents played in iteration 439 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:48:52,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:52,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:52,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:53,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:53,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:53,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:54,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:54,723][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:55,050][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:55,376][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:55,702][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:56,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:57,010][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:57,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:57,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:57,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:58,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:59,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:59,955][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:00,283][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:00,616][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:01,278][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:01,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:01,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:02,595][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:02,921][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:03,576][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:04,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:04,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:05,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:05,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:05,657][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:06,650][__main__][INFO] - Iteration 440 took 22s (37.29% Gen, 58.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 6m 40s. Estimated total time: 18h 51m 57s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 39s. [2025-11-13 10:49:06,652][__main__][INFO] - Starting iteration 440. [2025-11-13 10:49:06,655][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:49:06,656][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:15,470][__main__][INFO] - Number of regex retries in iteration 440: 0 [2025-11-13 10:49:15,470][__main__][INFO] - agents played in iteration 440 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:49:15,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:15,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:15,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:16,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:16,025][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:16,025][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:17,068][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:17,395][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:17,721][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:18,705][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:19,031][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:19,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:20,338][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:20,995][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:21,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:21,982][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:22,311][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:22,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:22,965][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:23,621][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:23,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:24,605][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:24,932][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:25,588][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:25,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:26,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:26,574][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:26,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:27,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:27,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:28,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:28,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:28,805][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:30,735][__main__][INFO] - Iteration 441 took 24s (36.60% Gen, 55.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 18m 21s. Estimated total time: 20h 4m 2s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 8s, 500 more iterations: 3h 20m 40s. [2025-11-13 10:49:30,737][__main__][INFO] - Starting iteration 441. [2025-11-13 10:49:30,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:30,741][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:39,769][__main__][INFO] - Number of regex retries in iteration 441: 0 [2025-11-13 10:49:39,770][__main__][INFO] - agents played in iteration 441 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:49:40,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:40,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:40,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:40,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:40,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:40,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:41,378][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:41,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:42,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:42,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:42,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:43,012][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:43,337][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:43,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:43,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:44,315][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:44,969][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:45,622][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:45,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:46,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:46,601][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:46,929][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:47,259][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:47,915][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:48,241][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:49,230][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:49,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:49,887][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:50,544][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:51,203][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:51,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:52,208][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:52,936][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:52,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:52,940][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:54,120][__main__][INFO] - Iteration 442 took 23s (38.62% Gen, 56.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 42m 56s. Estimated total time: 19h 29m 0s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 50s. [2025-11-13 10:49:54,121][__main__][INFO] - Starting iteration 442. [2025-11-13 10:49:54,125][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:54,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:03,853][__main__][INFO] - Number of regex retries in iteration 442: 0 [2025-11-13 10:50:03,855][__main__][INFO] - agents played in iteration 442 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:50:04,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:04,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:05,469][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:05,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:06,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:06,780][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:07,106][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:07,433][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:07,759][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:08,086][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:08,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:09,069][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:09,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:09,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:10,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:10,372][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:10,701][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:11,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:12,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:12,345][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:12,673][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:13,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:13,328][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:14,311][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:14,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:15,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:15,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:16,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:17,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:17,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:17,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:18,049][__main__][INFO] - Iteration 443 took 23s (40.67% Gen, 55.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 46s. Estimated total time: 19h 56m 15s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 22s. [2025-11-13 10:50:18,051][__main__][INFO] - Starting iteration 443. [2025-11-13 10:50:18,054][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:18,055][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:27,694][__main__][INFO] - Number of regex retries in iteration 443: 0 [2025-11-13 10:50:27,695][__main__][INFO] - agents played in iteration 443 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:50:28,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:28,259][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:30,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:30,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:30,943][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:31,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:32,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:32,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:33,234][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:33,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:34,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:34,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:34,869][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:35,196][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:35,524][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:35,852][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:37,170][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:37,500][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:38,160][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:38,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:38,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:39,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:40,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:40,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:40,965][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:40,967][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:41,925][__main__][INFO] - Iteration 444 took 23s (40.38% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 41s. Estimated total time: 19h 53m 34s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 55s. [2025-11-13 10:50:41,927][__main__][INFO] - Starting iteration 444. [2025-11-13 10:50:41,931][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:41,931][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:50,996][__main__][INFO] - Number of regex retries in iteration 444: 0 [2025-11-13 10:50:50,997][__main__][INFO] - agents played in iteration 444 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:50:51,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:51,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:52,315][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:52,613][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:52,941][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:53,269][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:53,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:54,249][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:54,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:54,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:55,228][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:55,553][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:56,204][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:56,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:57,513][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:57,839][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:58,491][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:59,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:00,784][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:01,769][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:02,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:02,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:02,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:03,475][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:04,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:04,199][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:04,201][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:05,218][__main__][INFO] - Iteration 445 took 23s (38.93% Gen, 56.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 37m 9s. Estimated total time: 19h 24m 25s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 4s. [2025-11-13 10:51:05,220][__main__][INFO] - Starting iteration 445. [2025-11-13 10:51:05,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:05,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:14,291][__main__][INFO] - Number of regex retries in iteration 445: 0 [2025-11-13 10:51:14,292][__main__][INFO] - agents played in iteration 445 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:51:14,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:14,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:14,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:14,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:14,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:14,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:15,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:16,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:16,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:17,224][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:18,203][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:18,529][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:18,855][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:19,507][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:19,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:20,815][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:21,144][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:21,799][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:22,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:22,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:23,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:23,435][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:23,765][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:24,092][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:24,421][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:25,084][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:25,410][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:25,737][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:26,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:26,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:27,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:27,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:27,474][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:28,481][__main__][INFO] - Iteration 446 took 23s (38.99% Gen, 56.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 15s. Estimated total time: 19h 22m 55s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 49s. [2025-11-13 10:51:28,483][__main__][INFO] - Starting iteration 446. [2025-11-13 10:51:28,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:28,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:37,842][__main__][INFO] - Number of regex retries in iteration 446: 0 [2025-11-13 10:51:37,843][__main__][INFO] - agents played in iteration 446 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:51:38,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:38,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:38,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:38,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:38,405][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:38,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:39,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:39,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:39,804][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:40,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:40,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:41,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:42,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:42,746][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:43,074][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:43,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:44,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:44,383][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:45,035][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:45,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:46,664][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:46,991][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:47,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:47,649][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:48,306][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:48,639][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:49,299][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:49,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:50,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:51,048][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:51,050][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:51,051][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:52,068][__main__][INFO] - Iteration 447 took 23s (39.68% Gen, 56.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 51m 4s. Estimated total time: 19h 39m 7s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 31s. [2025-11-13 10:51:52,070][__main__][INFO] - Starting iteration 447. [2025-11-13 10:51:52,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:52,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:00,771][__main__][INFO] - Number of regex retries in iteration 447: 0 [2025-11-13 10:52:00,772][__main__][INFO] - agents played in iteration 447 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:52:01,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:01,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:01,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:01,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:01,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:01,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:02,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:03,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:03,379][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:03,707][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:04,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:04,360][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:05,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:05,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:05,666][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:05,992][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:06,317][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:06,645][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:06,971][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:07,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:08,609][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:08,940][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:09,270][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:09,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:09,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:10,251][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:10,904][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:11,558][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:12,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:12,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:13,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:14,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:14,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:14,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:15,297][__main__][INFO] - Iteration 448 took 23s (37.45% Gen, 58.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 32m 48s. Estimated total time: 19h 21m 15s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 32s. [2025-11-13 10:52:15,299][__main__][INFO] - Starting iteration 448. [2025-11-13 10:52:15,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:15,303][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:25,124][__main__][INFO] - Number of regex retries in iteration 448: 0 [2025-11-13 10:52:25,124][__main__][INFO] - agents played in iteration 448 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:52:25,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,691][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:25,692][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:26,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:26,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:27,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:27,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:28,703][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:29,028][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:29,355][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:30,008][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:31,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:31,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:31,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:31,980][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:32,308][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:32,963][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:33,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:34,939][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:35,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:36,578][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:36,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:37,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:38,357][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:38,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:38,360][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:39,320][__main__][INFO] - Iteration 449 took 24s (40.89% Gen, 55.10% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 4s. Estimated total time: 20h 0m 54s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 9s. [2025-11-13 10:52:39,322][__main__][INFO] - Starting iteration 449. [2025-11-13 10:52:39,325][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:39,326][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:48,437][__main__][INFO] - Number of regex retries in iteration 449: 0 [2025-11-13 10:52:48,438][__main__][INFO] - agents played in iteration 449 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:52:48,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:48,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:48,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:49,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:49,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:49,012][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:49,763][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:50,715][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:51,041][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:51,368][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:51,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:52,672][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:52,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:53,327][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:53,655][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:53,980][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:54,308][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:54,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:55,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:55,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:55,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:56,272][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:56,600][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:56,926][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:57,913][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:58,245][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:58,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:59,230][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:59,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:59,884][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:00,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:00,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:01,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:01,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:01,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:02,595][__main__][INFO] - Iteration 450 took 23s (39.16% Gen, 56.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 34m 19s. Estimated total time: 19h 23m 32s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 55s. [2025-11-13 10:53:02,598][__main__][INFO] - Starting iteration 450. [2025-11-13 10:53:02,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:53:02,602][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:12,021][__main__][INFO] - Number of regex retries in iteration 450: 0 [2025-11-13 10:53:12,021][__main__][INFO] - agents played in iteration 450 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:53:12,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:12,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:12,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:12,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:12,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:12,583][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:13,645][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:14,301][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:14,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:15,283][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:15,609][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:15,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:16,262][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:16,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:17,243][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:17,570][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:17,902][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:18,890][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:19,215][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:19,541][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:19,869][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:20,852][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:21,181][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:21,509][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:21,835][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:22,167][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:22,494][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:22,823][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:23,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:23,482][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:23,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:24,489][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:25,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:25,221][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:25,223][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:27,116][__main__][INFO] - Iteration 451 took 24s (38.43% Gen, 53.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 8s. Estimated total time: 20h 25m 46s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 51s, 500 more iterations: 3h 24m 17s. [2025-11-13 10:53:27,118][__main__][INFO] - Starting iteration 451. [2025-11-13 10:53:27,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:27,122][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:36,242][__main__][INFO] - Number of regex retries in iteration 451: 0 [2025-11-13 10:53:36,242][__main__][INFO] - agents played in iteration 451 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:53:36,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:36,803][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:37,853][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:38,180][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:38,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:39,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:39,487][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:39,814][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:40,140][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:40,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:40,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:41,124][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:41,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:42,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:42,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:42,768][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:43,096][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:43,421][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:44,079][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:44,406][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:45,067][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:45,399][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:45,726][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:46,053][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:46,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:47,043][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:47,374][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:48,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:48,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:49,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:49,466][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:49,467][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:50,436][__main__][INFO] - Iteration 452 took 23s (39.11% Gen, 56.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 43s. Estimated total time: 19h 25m 45s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 17s. [2025-11-13 10:53:50,438][__main__][INFO] - Starting iteration 452. [2025-11-13 10:53:50,441][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:50,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:59,443][__main__][INFO] - Number of regex retries in iteration 452: 0 [2025-11-13 10:53:59,444][__main__][INFO] - agents played in iteration 452 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:53:59,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:59,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:59,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:00,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:00,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:00,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:01,069][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:01,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:01,723][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:02,049][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:03,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:03,685][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:04,010][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:04,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:04,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:05,318][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:05,645][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:06,952][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:07,279][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:07,605][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:08,257][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:08,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:10,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:10,554][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:11,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:11,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:12,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:12,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:12,648][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:13,625][__main__][INFO] - Iteration 453 took 23s (38.83% Gen, 56.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 28m 51s. Estimated total time: 19h 19m 15s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 12s. [2025-11-13 10:54:13,627][__main__][INFO] - Starting iteration 453. [2025-11-13 10:54:13,631][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:13,632][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:22,922][__main__][INFO] - Number of regex retries in iteration 453: 0 [2025-11-13 10:54:22,922][__main__][INFO] - agents played in iteration 453 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:54:23,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:23,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:23,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:23,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:23,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:23,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:24,247][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:24,544][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:25,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:25,852][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:26,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:27,160][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:27,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:27,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:28,142][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:28,472][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:28,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:29,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:29,454][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:29,780][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:30,434][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:30,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:31,091][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:31,420][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:31,749][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:32,731][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:33,059][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:33,717][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:34,044][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:34,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:35,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:36,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:36,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:36,103][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:37,078][__main__][INFO] - Iteration 454 took 23s (39.62% Gen, 56.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 41m 37s. Estimated total time: 19h 32m 24s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 24s. [2025-11-13 10:54:37,081][__main__][INFO] - Starting iteration 454. [2025-11-13 10:54:37,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:37,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:46,625][__main__][INFO] - Number of regex retries in iteration 454: 0 [2025-11-13 10:54:46,626][__main__][INFO] - agents played in iteration 454 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:54:47,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:47,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:48,233][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:48,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:49,213][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:49,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:49,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:50,192][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:50,519][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:50,848][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:51,173][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:51,500][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:51,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:52,157][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:52,485][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:53,466][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:53,792][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:54,120][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:55,103][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:56,087][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:56,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:57,069][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:57,731][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:58,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:59,064][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:59,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:59,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:59,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:00,836][__main__][INFO] - Iteration 455 took 23s (40.17% Gen, 55.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 26s. Estimated total time: 19h 47m 38s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 56s. [2025-11-13 10:55:00,838][__main__][INFO] - Starting iteration 455. [2025-11-13 10:55:00,842][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:00,842][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:10,529][__main__][INFO] - Number of regex retries in iteration 455: 0 [2025-11-13 10:55:10,529][__main__][INFO] - agents played in iteration 455 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:55:10,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,089][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:11,090][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:11,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:12,150][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:12,479][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:12,805][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:13,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:13,457][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:13,785][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:14,438][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:14,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:15,091][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:15,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:15,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:16,073][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:16,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:16,729][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:17,383][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:17,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:18,365][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:18,692][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:19,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:19,671][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:20,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:20,653][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:20,981][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:21,311][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:22,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:22,979][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:23,717][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:23,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:23,721][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:24,709][__main__][INFO] - Iteration 456 took 23s (40.59% Gen, 55.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 49s. Estimated total time: 19h 53m 24s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 54s. [2025-11-13 10:55:24,711][__main__][INFO] - Starting iteration 456. [2025-11-13 10:55:24,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:24,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:33,533][__main__][INFO] - Number of regex retries in iteration 456: 0 [2025-11-13 10:55:33,534][__main__][INFO] - agents played in iteration 456 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:55:33,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,094][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:34,094][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:34,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:35,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:35,478][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:36,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:36,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:37,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:38,096][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:38,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:39,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:39,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:39,745][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:40,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:40,402][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:41,056][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:43,024][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:43,354][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:44,007][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:44,340][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:44,668][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:44,994][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:45,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:46,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:46,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:46,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:46,753][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:47,706][__main__][INFO] - Iteration 457 took 22s (38.36% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 17m 40s. Estimated total time: 19h 9m 39s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 36s. [2025-11-13 10:55:47,708][__main__][INFO] - Starting iteration 457. [2025-11-13 10:55:47,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:47,713][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:56,847][__main__][INFO] - Number of regex retries in iteration 457: 0 [2025-11-13 10:55:56,847][__main__][INFO] - agents played in iteration 457 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:55:57,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:57,419][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:58,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:58,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:59,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:59,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:59,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:00,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:02,075][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:02,401][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:03,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:04,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:04,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:04,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:05,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:05,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:05,679][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:06,006][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:06,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:06,992][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:07,324][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:07,653][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:07,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:08,308][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:08,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:09,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:10,105][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:10,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:10,108][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:11,137][__main__][INFO] - Iteration 458 took 23s (39.00% Gen, 56.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 38m 57s. Estimated total time: 19h 31m 19s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 13s. [2025-11-13 10:56:11,139][__main__][INFO] - Starting iteration 458. [2025-11-13 10:56:11,143][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:11,143][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:20,754][__main__][INFO] - Number of regex retries in iteration 458: 0 [2025-11-13 10:56:20,754][__main__][INFO] - agents played in iteration 458 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:56:21,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,320][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:21,321][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:22,073][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:22,370][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:22,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:23,354][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:23,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:24,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:24,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:25,316][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:25,642][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:25,969][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:26,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:26,946][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:27,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:27,600][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:27,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:28,255][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:29,568][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:29,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:30,227][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:30,553][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:31,214][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:31,545][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:31,870][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:32,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:32,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:33,253][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:33,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:33,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:33,990][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:34,948][__main__][INFO] - Iteration 459 took 23s (40.37% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 57m 34s. Estimated total time: 19h 50m 20s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 23s. [2025-11-13 10:56:34,950][__main__][INFO] - Starting iteration 459. [2025-11-13 10:56:34,954][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:34,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:44,054][__main__][INFO] - Number of regex retries in iteration 459: 0 [2025-11-13 10:56:44,055][__main__][INFO] - agents played in iteration 459 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:56:44,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,614][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:44,615][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:45,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:46,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:46,647][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:46,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:47,302][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:47,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:48,281][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:48,608][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:48,940][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:49,267][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:49,920][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:50,572][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:50,899][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:51,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:51,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:51,882][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:52,211][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:52,872][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:53,526][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:53,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:54,188][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:54,517][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:54,843][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:55,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:55,494][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:55,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:56,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:57,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:57,279][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:57,281][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:58,268][__main__][INFO] - Iteration 460 took 23s (39.03% Gen, 56.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 32m 37s. Estimated total time: 19h 25m 46s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 17s. [2025-11-13 10:56:58,270][__main__][INFO] - Starting iteration 460. [2025-11-13 10:56:58,274][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:58,274][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:07,674][__main__][INFO] - Number of regex retries in iteration 460: 0 [2025-11-13 10:57:07,675][__main__][INFO] - agents played in iteration 460 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:57:08,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:08,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:08,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:08,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:08,239][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:08,240][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:09,007][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:09,633][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:10,290][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:10,946][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:11,600][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:11,926][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:13,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:13,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:14,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:14,877][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:15,205][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:16,187][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:16,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:16,847][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:17,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:17,504][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:17,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:18,160][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:18,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:18,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:19,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:19,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:20,155][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:20,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:20,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:20,905][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:22,863][__main__][INFO] - Iteration 461 took 24s (38.23% Gen, 53.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 35m 57s. Estimated total time: 20h 29m 31s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 59s, 500 more iterations: 3h 24m 55s. [2025-11-13 10:57:22,865][__main__][INFO] - Starting iteration 461. [2025-11-13 10:57:22,869][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:22,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:31,957][__main__][INFO] - Number of regex retries in iteration 461: 0 [2025-11-13 10:57:31,957][__main__][INFO] - agents played in iteration 461 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:57:32,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:32,520][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:33,283][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:33,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:34,239][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:34,566][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:34,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:35,222][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:35,875][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:36,203][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:36,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:37,182][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:37,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:37,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:38,492][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:38,818][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:39,146][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:39,472][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:39,800][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:40,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:40,453][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:41,108][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:41,433][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:42,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:42,418][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:43,071][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:43,397][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:43,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:44,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:45,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:45,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:45,156][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:46,253][__main__][INFO] - Iteration 462 took 23s (38.86% Gen, 56.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 18s. Estimated total time: 19h 29m 15s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 52s. [2025-11-13 10:57:46,255][__main__][INFO] - Starting iteration 462. [2025-11-13 10:57:46,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:46,258][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:56,098][__main__][INFO] - Number of regex retries in iteration 462: 0 [2025-11-13 10:57:56,099][__main__][INFO] - agents played in iteration 462 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:57:56,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:56,659][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:57,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:58,021][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:58,349][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:58,678][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:59,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:59,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:59,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:00,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:00,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:00,969][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:01,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:01,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:02,283][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:02,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:03,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:03,939][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:04,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:04,596][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:04,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:05,582][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:06,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:06,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:06,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:07,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:08,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:09,324][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:09,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:09,327][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:10,311][__main__][INFO] - Iteration 463 took 24s (40.91% Gen, 55.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 8m 20s. Estimated total time: 20h 2m 41s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 26s. [2025-11-13 10:58:10,313][__main__][INFO] - Starting iteration 463. [2025-11-13 10:58:10,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:10,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:19,556][__main__][INFO] - Number of regex retries in iteration 463: 0 [2025-11-13 10:58:19,557][__main__][INFO] - agents played in iteration 463 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:58:20,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:20,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:20,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:20,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:20,123][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:20,123][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:20,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:21,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:21,497][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:21,824][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:22,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:22,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:23,459][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:24,115][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:24,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:24,772][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:25,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:25,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:26,089][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:26,421][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:27,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:27,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:28,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:28,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:28,722][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:29,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:29,702][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:30,680][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:31,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:31,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:32,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:32,786][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:32,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:32,790][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:33,788][__main__][INFO] - Iteration 464 took 23s (39.36% Gen, 56.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 38m 53s. Estimated total time: 19h 33m 38s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 36s. [2025-11-13 10:58:33,790][__main__][INFO] - Starting iteration 464. [2025-11-13 10:58:33,793][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:33,794][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:42,259][__main__][INFO] - Number of regex retries in iteration 464: 0 [2025-11-13 10:58:42,260][__main__][INFO] - agents played in iteration 464 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:58:42,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:42,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:42,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:42,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:42,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:42,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:43,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:44,524][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:44,851][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:45,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:45,503][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:45,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:46,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:46,483][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:46,809][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:47,136][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:47,469][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:48,128][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:48,454][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:48,787][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:49,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:49,774][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:50,103][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:50,432][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:50,758][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:51,085][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:51,414][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:51,743][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:52,403][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:52,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:53,056][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:53,381][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:54,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:54,758][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:55,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:55,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:55,494][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:56,570][__main__][INFO] - Iteration 465 took 22s (37.17% Gen, 58.10% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 3m 44s. Estimated total time: 18h 58m 52s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 48s. [2025-11-13 10:58:56,572][__main__][INFO] - Starting iteration 465. [2025-11-13 10:58:56,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:56,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:06,265][__main__][INFO] - Number of regex retries in iteration 465: 0 [2025-11-13 10:59:06,266][__main__][INFO] - agents played in iteration 465 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:59:06,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,828][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:06,829][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:07,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:08,200][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:08,528][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:08,855][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:09,181][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:09,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:10,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:10,816][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:11,141][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:11,797][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:12,124][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:12,455][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:13,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:14,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:14,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:14,750][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:15,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:15,410][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:15,738][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:16,066][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:16,393][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:16,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:17,054][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:17,382][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:18,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:18,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:19,492][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:19,494][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:19,495][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:20,491][__main__][INFO] - Iteration 466 took 23s (40.52% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 0m 17s. Estimated total time: 19h 55m 48s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 18s. [2025-11-13 10:59:20,493][__main__][INFO] - Starting iteration 466. [2025-11-13 10:59:20,495][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:20,496][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:29,581][__main__][INFO] - Number of regex retries in iteration 466: 0 [2025-11-13 10:59:29,581][__main__][INFO] - agents played in iteration 466 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:59:30,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:30,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:30,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:30,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:30,145][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:30,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:31,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:33,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:34,127][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:34,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:34,785][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:35,112][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:36,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:36,755][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:37,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:37,749][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:38,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:38,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:38,734][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:39,064][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:39,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:39,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:40,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:41,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:41,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:42,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:42,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:42,803][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:42,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:43,705][__main__][INFO] - Iteration 467 took 23s (39.14% Gen, 56.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 24m 37s. Estimated total time: 19h 20m 31s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 25s. [2025-11-13 10:59:43,708][__main__][INFO] - Starting iteration 467. [2025-11-13 10:59:43,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:43,711][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:53,061][__main__][INFO] - Number of regex retries in iteration 467: 0 [2025-11-13 10:59:53,062][__main__][INFO] - agents played in iteration 467 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 10:59:53,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:53,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:53,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:53,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:53,621][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:53,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:54,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:54,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:54,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:55,305][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:55,632][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:55,957][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:56,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:56,937][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:57,921][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:58,249][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:58,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:59,230][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:59,556][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:59,884][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:00,210][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:00,537][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:00,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:01,193][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:01,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:01,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:02,174][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:02,827][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:03,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:03,809][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:04,135][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:04,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:05,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:06,232][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:06,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:06,235][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:07,152][__main__][INFO] - Iteration 468 took 23s (39.89% Gen, 56.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 48s. Estimated total time: 19h 32m 6s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 21s. [2025-11-13 11:00:07,155][__main__][INFO] - Starting iteration 468. [2025-11-13 11:00:07,158][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:07,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:16,470][__main__][INFO] - Number of regex retries in iteration 468: 0 [2025-11-13 11:00:16,471][__main__][INFO] - agents played in iteration 468 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:00:16,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:16,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:16,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:17,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:17,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:17,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:17,782][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:18,735][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:19,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:19,390][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:20,043][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:20,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:20,698][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:21,026][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:21,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:22,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:22,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:23,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:23,992][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:24,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:24,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:25,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:25,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:25,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:26,296][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:26,622][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:26,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:27,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:27,606][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:28,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:28,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:29,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:29,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:29,719][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:30,615][__main__][INFO] - Iteration 469 took 23s (39.70% Gen, 56.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 36m 12s. Estimated total time: 19h 32m 53s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 28s. [2025-11-13 11:00:30,617][__main__][INFO] - Starting iteration 469. [2025-11-13 11:00:30,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:30,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:39,728][__main__][INFO] - Number of regex retries in iteration 469: 0 [2025-11-13 11:00:39,729][__main__][INFO] - agents played in iteration 469 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:00:40,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:40,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:40,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:40,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:40,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:40,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:41,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:41,318][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:41,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:42,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:42,957][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:43,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:43,610][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:43,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:44,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:44,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:45,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:46,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:46,558][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:46,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:47,213][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:47,873][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:48,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:48,858][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:49,184][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:49,511][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:49,836][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:50,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:50,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:51,147][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:51,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:52,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:52,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:52,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:52,901][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:53,867][__main__][INFO] - Iteration 470 took 23s (39.18% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 25m 20s. Estimated total time: 19h 22m 24s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 44s. [2025-11-13 11:00:53,870][__main__][INFO] - Starting iteration 470. [2025-11-13 11:00:53,873][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:53,873][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:03,201][__main__][INFO] - Number of regex retries in iteration 470: 0 [2025-11-13 11:01:03,201][__main__][INFO] - agents played in iteration 470 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:01:03,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:03,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:04,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:04,832][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:05,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:05,817][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:06,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:06,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:06,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:07,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:07,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:07,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:08,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:08,432][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:08,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:09,423][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:09,756][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:10,088][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:10,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:10,747][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:11,073][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:11,400][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:12,057][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:12,712][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:13,040][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:13,697][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:14,029][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:14,680][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:15,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:15,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:16,455][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:16,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:16,458][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:18,341][__main__][INFO] - Iteration 471 took 24s (38.12% Gen, 54.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 25m 57s. Estimated total time: 20h 23m 26s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 46s, 500 more iterations: 3h 23m 54s. [2025-11-13 11:01:18,343][__main__][INFO] - Starting iteration 471. [2025-11-13 11:01:18,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:18,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:27,297][__main__][INFO] - Number of regex retries in iteration 471: 0 [2025-11-13 11:01:27,298][__main__][INFO] - agents played in iteration 471 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:01:27,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:28,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:28,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:28,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:28,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:28,202][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:28,937][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:29,236][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:29,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:31,196][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:31,524][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:31,850][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:32,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:33,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:33,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:34,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:34,797][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:35,123][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:35,450][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:36,760][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:37,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:38,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:38,401][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:39,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:40,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:40,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:40,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:40,806][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:41,681][__main__][INFO] - Iteration 472 took 23s (38.36% Gen, 57.88% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 28m 56s. Estimated total time: 19h 26m 48s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 28s. [2025-11-13 11:01:41,683][__main__][INFO] - Starting iteration 472. [2025-11-13 11:01:41,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:41,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:50,892][__main__][INFO] - Number of regex retries in iteration 472: 0 [2025-11-13 11:01:50,893][__main__][INFO] - agents played in iteration 472 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:01:51,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:51,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:51,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:51,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:51,450][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:51,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:52,495][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:52,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:53,152][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:53,805][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:54,132][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:54,459][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:54,788][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:55,113][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:55,440][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:56,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:56,422][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:56,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:57,074][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:57,729][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:58,055][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:58,384][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:59,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:59,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:00,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:00,350][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:00,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:01,005][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:01,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:01,666][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:01,997][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:02,328][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:02,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:03,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:04,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:04,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:04,082][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:05,024][__main__][INFO] - Iteration 473 took 23s (39.45% Gen, 56.51% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 28m 40s. Estimated total time: 19h 26m 55s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 29s. [2025-11-13 11:02:05,026][__main__][INFO] - Starting iteration 473. [2025-11-13 11:02:05,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:05,029][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:13,831][__main__][INFO] - Number of regex retries in iteration 473: 0 [2025-11-13 11:02:13,832][__main__][INFO] - agents played in iteration 473 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:02:14,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:14,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:14,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:14,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:14,389][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:14,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:15,132][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:15,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:16,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:16,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:17,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:17,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:18,047][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:18,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:19,357][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:20,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:20,669][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:20,995][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:21,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:21,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:22,303][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:22,959][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:23,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:23,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:24,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:24,593][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:25,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:25,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:26,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:27,018][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:27,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:27,022][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:27,936][__main__][INFO] - Iteration 474 took 22s (38.43% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 6m 44s. Estimated total time: 19h 5m 23s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 53s. [2025-11-13 11:02:27,938][__main__][INFO] - Starting iteration 474. [2025-11-13 11:02:27,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:27,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:37,243][__main__][INFO] - Number of regex retries in iteration 474: 0 [2025-11-13 11:02:37,243][__main__][INFO] - agents played in iteration 474 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:02:37,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:37,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:37,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:37,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:37,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:37,853][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:38,590][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:39,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:39,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:39,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:40,521][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:40,847][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:41,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:41,502][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:42,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:42,488][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:42,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:43,141][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:44,125][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:44,451][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:44,779][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:45,106][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:45,432][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:45,760][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:46,087][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:46,414][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:46,743][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:47,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:47,721][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:48,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:48,706][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:49,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:49,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:50,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:50,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:50,487][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:51,380][__main__][INFO] - Iteration 475 took 23s (39.68% Gen, 56.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 32m 57s. Estimated total time: 19h 32m 0s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 20s. [2025-11-13 11:02:51,382][__main__][INFO] - Starting iteration 475. [2025-11-13 11:02:51,385][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:51,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:00,571][__main__][INFO] - Number of regex retries in iteration 475: 0 [2025-11-13 11:03:00,571][__main__][INFO] - agents played in iteration 475 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:03:01,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:01,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:01,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:01,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:01,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:01,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:02,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:03,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:03,812][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:04,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:04,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:04,791][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:05,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:06,755][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:07,080][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:07,738][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:08,066][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:08,393][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:08,719][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:09,373][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:09,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:10,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:10,678][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:11,006][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:11,335][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:11,663][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:11,990][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:12,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:13,031][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:13,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:13,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:13,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:14,655][__main__][INFO] - Iteration 476 took 23s (39.47% Gen, 56.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 24m 7s. Estimated total time: 19h 23m 33s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 55s. [2025-11-13 11:03:14,657][__main__][INFO] - Starting iteration 476. [2025-11-13 11:03:14,660][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:14,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:23,882][__main__][INFO] - Number of regex retries in iteration 476: 0 [2025-11-13 11:03:23,883][__main__][INFO] - agents played in iteration 476 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:03:24,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:24,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:25,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:25,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:25,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:26,460][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:27,113][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:27,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:27,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:28,092][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:28,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:29,397][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:29,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:30,383][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:30,711][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:31,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:31,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:32,018][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:32,345][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:33,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:33,335][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:33,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:33,989][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:34,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:34,641][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:34,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:35,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:36,346][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:37,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:37,083][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:37,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:38,025][__main__][INFO] - Iteration 477 took 23s (39.47% Gen, 56.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 28m 29s. Estimated total time: 19h 28m 17s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 42s. [2025-11-13 11:03:38,028][__main__][INFO] - Starting iteration 477. [2025-11-13 11:03:38,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:38,031][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:06:51,636][mllm.models.large_language_model_local][INFO] - Loaded 47 past agent adapters from checkpoints directory. [2025-11-13 11:07:10,505][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,721][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,729][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:07:13,187][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:09:23,128][mllm.training.trainer_common][INFO] - Loading trainer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:09:23,130][mllm.training.trainer_common][INFO] - Loading policy optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:09:23,884][mllm.training.trainer_common][INFO] - Loading critic optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:09:23,886][__main__][INFO] - Starting iteration 477. [2025-11-13 11:09:23,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:09:23,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:09:53,916][__main__][INFO] - Number of regex retries in iteration 477: 0 [2025-11-13 11:09:53,917][__main__][INFO] - agents played in iteration 477 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:09:54,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:54,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:54,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:54,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:54,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:09:54,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:09:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:09:55,715][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:09:56,038][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:09:56,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:09:56,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:09:57,010][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:09:57,337][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:09:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:09:57,983][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:09:58,308][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:09:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:09:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:09:59,276][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:09:59,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:09:59,920][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:00,243][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:00,566][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:00,888][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:01,533][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:01,855][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:02,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:02,820][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:03,466][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:03,787][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:04,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:05,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:05,403][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:05,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:06,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.78%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:10:07,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:07,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:07,395][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:08,417][__main__][INFO] - Iteration 478 took 44s (67.43% Gen, 30.27% Train). Generation: 30s, Training: 13s. Estimated remaining time: 37h 3m 1s. Estimated total time: 37h 6m 20s. Time estimates for 10 more iterations: 7m 25s, 100 more iterations: 1h 14m 12s, 500 more iterations: 6h 11m 3s. [2025-11-13 11:10:08,419][__main__][INFO] - Starting iteration 478. [2025-11-13 11:10:08,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:08,423][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:26,812][__main__][INFO] - Number of regex retries in iteration 478: 0 [2025-11-13 11:10:26,813][__main__][INFO] - agents played in iteration 478 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:10:27,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:27,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:27,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:27,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:27,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:27,354][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:28,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:29,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:29,948][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:30,595][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:30,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:31,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:31,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:32,213][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:32,866][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:33,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:33,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:33,837][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:34,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:34,484][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:34,806][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:35,130][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:35,779][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:36,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:37,076][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:37,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:38,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:38,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:39,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:10:39,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:39,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:39,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:40,812][__main__][INFO] - Iteration 479 took 32s (56.77% Gen, 40.18% Train). Generation: 18s, Training: 13s. Estimated remaining time: 26h 55m 41s. Estimated total time: 26h 59m 32s. Time estimates for 10 more iterations: 5m 23s, 100 more iterations: 53m 59s, 500 more iterations: 4h 29m 55s. [2025-11-13 11:10:40,814][__main__][INFO] - Starting iteration 479. [2025-11-13 11:10:40,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:40,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:53,205][__main__][INFO] - Number of regex retries in iteration 479: 0 [2025-11-13 11:10:53,206][__main__][INFO] - agents played in iteration 479 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:10:53,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:53,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:54,461][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:54,756][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:55,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:56,054][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:56,377][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:56,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:57,672][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:57,996][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:58,321][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:58,968][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:59,291][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:59,947][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:00,271][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:00,593][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:00,927][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:01,252][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:01,574][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:01,898][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:02,561][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:02,890][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:03,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:03,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:04,198][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:04,523][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:04,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:05,574][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:06,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:06,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:06,297][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:07,357][__main__][INFO] - Iteration 480 took 26s (46.67% Gen, 49.32% Train). Generation: 12s, Training: 13s. Estimated remaining time: 22h 2m 46s. Estimated total time: 22h 7m 3s. Time estimates for 10 more iterations: 4m 25s, 100 more iterations: 44m 14s, 500 more iterations: 3h 41m 10s. [2025-11-13 11:11:07,359][__main__][INFO] - Starting iteration 480. [2025-11-13 11:11:07,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:11:07,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:19,122][__main__][INFO] - Number of regex retries in iteration 480: 0 [2025-11-13 11:11:19,123][__main__][INFO] - agents played in iteration 480 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:11:19,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:19,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:20,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:21,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:21,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:22,294][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:22,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:23,268][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:23,916][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:24,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:25,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:25,859][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:26,508][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:27,157][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:27,481][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:27,804][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:28,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:28,460][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:28,778][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:30,079][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:30,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:31,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:32,127][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:32,129][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:32,131][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:33,980][__main__][INFO] - Iteration 481 took 26s (44.17% Gen, 48.87% Train). Generation: 11s, Training: 13s. Estimated remaining time: 22h 6m 11s. Estimated total time: 22h 10m 56s. Time estimates for 10 more iterations: 4m 26s, 100 more iterations: 44m 21s, 500 more iterations: 3h 41m 49s. [2025-11-13 11:11:33,982][__main__][INFO] - Starting iteration 481. [2025-11-13 11:11:33,985][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:33,985][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:45,084][__main__][INFO] - Number of regex retries in iteration 481: 0 [2025-11-13 11:11:45,085][__main__][INFO] - agents played in iteration 481 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:11:45,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:45,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:46,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:46,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:46,939][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:48,886][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:49,536][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:50,183][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:50,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:50,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:51,161][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:51,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:52,143][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:52,466][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:53,116][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:53,442][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:53,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:54,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:54,424][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:55,735][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:56,062][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:56,393][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:56,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:57,405][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:58,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:58,114][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:58,115][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:59,146][__main__][INFO] - Iteration 482 took 25s (44.11% Gen, 51.78% Train). Generation: 11s, Training: 13s. Estimated remaining time: 20h 52m 58s. Estimated total time: 20h 58m 7s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 56s, 500 more iterations: 3h 29m 41s. [2025-11-13 11:11:59,148][__main__][INFO] - Starting iteration 482. [2025-11-13 11:11:59,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:59,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:09,654][__main__][INFO] - Number of regex retries in iteration 482: 0 [2025-11-13 11:12:09,655][__main__][INFO] - agents played in iteration 482 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:12:10,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:10,201][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:10,889][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:11,186][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:11,511][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:11,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:12,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:12,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:13,130][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:13,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:13,778][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:14,103][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:14,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:15,074][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:15,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:15,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:16,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:16,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:17,016][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:17,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:17,668][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:18,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:18,641][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:18,965][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:19,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:19,621][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:19,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:20,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:20,611][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:20,938][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:21,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:21,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:22,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:22,672][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:22,674][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:23,472][__main__][INFO] - Iteration 483 took 24s (43.18% Gen, 53.53% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 10m 32s. Estimated total time: 20h 16m 6s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 32s, 500 more iterations: 3h 22m 41s. [2025-11-13 11:12:23,474][__main__][INFO] - Starting iteration 483. [2025-11-13 11:12:23,477][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:23,477][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:33,718][__main__][INFO] - Number of regex retries in iteration 483: 0 [2025-11-13 11:12:33,719][__main__][INFO] - agents played in iteration 483 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:12:34,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:34,259][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:34,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:35,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:35,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:35,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:36,559][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:36,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:37,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:37,869][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:38,193][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:38,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:38,842][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:39,165][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:40,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:40,459][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:40,783][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:41,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:41,429][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:42,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:42,402][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:42,727][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:43,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:43,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:45,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:46,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:46,724][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:46,725][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:46,726][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:47,707][__main__][INFO] - Iteration 484 took 24s (42.27% Gen, 53.68% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 5m 34s. Estimated total time: 20h 11m 32s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 23s, 500 more iterations: 3h 21m 55s. [2025-11-13 11:12:47,709][__main__][INFO] - Starting iteration 484. [2025-11-13 11:12:47,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:47,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:57,918][__main__][INFO] - Number of regex retries in iteration 484: 0 [2025-11-13 11:12:57,919][__main__][INFO] - agents played in iteration 484 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:12:58,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:58,468][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:59,161][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:59,783][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:00,107][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:00,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:00,754][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:01,078][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:01,726][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:02,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:02,699][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:03,022][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:03,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:03,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:04,964][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:05,287][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:05,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:05,940][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:06,265][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:06,925][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:07,253][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:07,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:07,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:08,885][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:09,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:10,216][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:10,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:10,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:10,971][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:11,777][__main__][INFO] - Iteration 485 took 24s (42.41% Gen, 54.23% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 56m 58s. Estimated total time: 20h 3m 20s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 6s, 500 more iterations: 3h 20m 33s. [2025-11-13 11:13:11,779][__main__][INFO] - Starting iteration 485. [2025-11-13 11:13:11,782][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:11,783][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:21,657][__main__][INFO] - Number of regex retries in iteration 485: 0 [2025-11-13 11:13:21,658][__main__][INFO] - agents played in iteration 485 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:13:22,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:22,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:22,892][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:23,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:23,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:24,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:25,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:25,458][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:25,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:26,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:26,766][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:27,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:28,063][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:28,711][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:29,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:29,685][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:30,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:30,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:31,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:31,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:31,976][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:32,302][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:32,626][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:32,950][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:33,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:33,975][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:34,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:34,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:34,686][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:35,679][__main__][INFO] - Iteration 486 took 23s (41.32% Gen, 54.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 48m 7s. Estimated total time: 19h 54m 53s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 8s. [2025-11-13 11:13:35,681][__main__][INFO] - Starting iteration 486. [2025-11-13 11:13:35,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:35,685][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:45,436][__main__][INFO] - Number of regex retries in iteration 486: 0 [2025-11-13 11:13:45,437][__main__][INFO] - agents played in iteration 486 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:13:45,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:45,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:45,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:45,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:45,980][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:45,980][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:46,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:47,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:47,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:47,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:48,264][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:48,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:48,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:49,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:49,893][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:50,220][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:50,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:51,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:51,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:52,170][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:52,495][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:52,821][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:53,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:53,472][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:53,797][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:54,124][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:54,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:55,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:55,431][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:56,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:56,411][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:56,737][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:57,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:57,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:58,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:58,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:58,452][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:59,385][__main__][INFO] - Iteration 487 took 23s (41.14% Gen, 54.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 37m 55s. Estimated total time: 19h 45m 4s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 30s. [2025-11-13 11:13:59,387][__main__][INFO] - Starting iteration 487. [2025-11-13 11:13:59,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:59,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:08,590][__main__][INFO] - Number of regex retries in iteration 487: 0 [2025-11-13 11:14:08,591][__main__][INFO] - agents played in iteration 487 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:14:09,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:09,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:09,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:10,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:10,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:10,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:11,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:11,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:11,789][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:12,120][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:12,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:13,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:14,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:14,391][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:14,714][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:15,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:15,690][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:16,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:17,003][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:17,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:17,982][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:18,308][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:18,635][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:19,292][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:20,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:20,977][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:21,666][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:21,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:21,668][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:22,579][__main__][INFO] - Iteration 488 took 23s (39.67% Gen, 56.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 11m 55s. Estimated total time: 19h 19m 28s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 14s. [2025-11-13 11:14:22,581][__main__][INFO] - Starting iteration 488. [2025-11-13 11:14:22,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:22,584][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:32,607][__main__][INFO] - Number of regex retries in iteration 488: 0 [2025-11-13 11:14:32,608][__main__][INFO] - agents played in iteration 488 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:14:33,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,161][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:33,162][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:33,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:34,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:35,125][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:36,758][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:37,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:37,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:38,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:38,393][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:38,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:39,043][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:39,372][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:39,699][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:40,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:40,349][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:40,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:41,982][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:42,308][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:42,634][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:42,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:43,285][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:43,610][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:43,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:44,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:44,959][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:45,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:45,662][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:45,663][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:46,509][__main__][INFO] - Iteration 489 took 23s (41.89% Gen, 54.57% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 48m 22s. Estimated total time: 19h 56m 19s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 23s. [2025-11-13 11:14:46,512][__main__][INFO] - Starting iteration 489. [2025-11-13 11:14:46,515][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:46,515][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:55,461][__main__][INFO] - Number of regex retries in iteration 489: 0 [2025-11-13 11:14:55,462][__main__][INFO] - agents played in iteration 489 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:14:55,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:55,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:55,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:56,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:56,015][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:56,015][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:56,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:56,989][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:57,315][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:57,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:57,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:58,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:58,937][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:59,263][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:59,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:00,238][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:00,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:00,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:01,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:02,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:02,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:02,859][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:03,184][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:03,835][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:04,161][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:04,489][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:04,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:05,142][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:05,475][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:05,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:06,126][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:07,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:07,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:08,520][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:08,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:08,524][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:09,491][__main__][INFO] - Iteration 490 took 22s (38.93% Gen, 56.85% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 0m 32s. Estimated total time: 19h 8m 52s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 28s. [2025-11-13 11:15:09,494][__main__][INFO] - Starting iteration 490. [2025-11-13 11:15:09,497][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:15:09,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:19,002][__main__][INFO] - Number of regex retries in iteration 490: 0 [2025-11-13 11:15:19,003][__main__][INFO] - agents played in iteration 490 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:15:19,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:19,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:19,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:19,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:19,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:19,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:20,252][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:20,549][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:20,875][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:21,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:21,848][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:22,173][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:23,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:24,129][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:24,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:24,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:25,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:25,434][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:25,769][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:26,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:26,751][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:27,080][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:27,406][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:27,734][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:28,063][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:28,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:28,717][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:29,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:29,697][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:30,347][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:30,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:31,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:32,071][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:32,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:32,078][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:33,779][__main__][INFO] - Iteration 491 took 24s (39.14% Gen, 53.85% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 5m 26s. Estimated total time: 20h 14m 10s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 28s, 500 more iterations: 3h 22m 21s. [2025-11-13 11:15:33,782][__main__][INFO] - Starting iteration 491. [2025-11-13 11:15:33,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:33,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:43,049][__main__][INFO] - Number of regex retries in iteration 491: 0 [2025-11-13 11:15:43,050][__main__][INFO] - agents played in iteration 491 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:15:43,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:43,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:43,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:43,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:43,597][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:43,597][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:44,586][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:44,912][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:45,896][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:46,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:46,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:47,196][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:47,526][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:47,853][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:48,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:48,516][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:48,840][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:49,167][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:49,821][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:50,145][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:50,473][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:50,799][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:51,125][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:51,451][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:51,776][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:52,106][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:52,759][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:53,736][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:54,064][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:54,389][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:54,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:55,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:56,149][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:56,150][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:56,152][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:57,177][__main__][INFO] - Iteration 492 took 23s (39.60% Gen, 56.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 20m 30s. Estimated total time: 19h 29m 38s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 56s. [2025-11-13 11:15:57,179][__main__][INFO] - Starting iteration 492. [2025-11-13 11:15:57,182][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:57,182][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:06,901][__main__][INFO] - Number of regex retries in iteration 492: 0 [2025-11-13 11:16:06,902][__main__][INFO] - agents played in iteration 492 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:16:07,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:07,460][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:08,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:08,782][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:09,111][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:10,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:10,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:10,727][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:11,052][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:11,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:11,711][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:12,032][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:12,357][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:12,682][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:13,011][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:13,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:13,985][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:14,315][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:15,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:15,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:15,944][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:16,924][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:17,248][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:17,573][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:17,898][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:18,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:19,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:19,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:19,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:19,961][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:21,103][__main__][INFO] - Iteration 493 took 23s (40.63% Gen, 54.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 46m 36s. Estimated total time: 19h 56m 7s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 21s. [2025-11-13 11:16:21,105][__main__][INFO] - Starting iteration 493. [2025-11-13 11:16:21,108][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:21,108][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:30,133][__main__][INFO] - Number of regex retries in iteration 493: 0 [2025-11-13 11:16:30,133][__main__][INFO] - agents played in iteration 493 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:16:30,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:30,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:30,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:30,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:30,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:30,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:31,371][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:31,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:32,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:32,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:33,307][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:33,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:33,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:34,281][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:34,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:34,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:35,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:35,588][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:35,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:36,900][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:37,877][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:38,203][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:38,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:38,853][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:39,511][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:39,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:40,484][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:40,809][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:41,460][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:41,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:42,494][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:43,179][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:43,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:43,182][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:44,123][__main__][INFO] - Iteration 494 took 23s (39.21% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 52s. Estimated total time: 19h 10m 47s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 47s. [2025-11-13 11:16:44,125][__main__][INFO] - Starting iteration 494. [2025-11-13 11:16:44,128][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:44,128][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:53,325][__main__][INFO] - Number of regex retries in iteration 494: 0 [2025-11-13 11:16:53,326][__main__][INFO] - agents played in iteration 494 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:16:53,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:53,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:53,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:53,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:53,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:53,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:54,540][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:55,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:55,820][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:56,471][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:57,131][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:57,460][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:58,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:58,776][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:59,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:59,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:00,403][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:00,728][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:01,054][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:01,380][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:01,711][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:02,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:02,362][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:02,686][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:03,012][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:03,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:03,663][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:04,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:04,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:05,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:06,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:06,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:06,377][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:07,438][__main__][INFO] - Iteration 495 took 23s (39.45% Gen, 55.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 15m 14s. Estimated total time: 19h 25m 31s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 15s. [2025-11-13 11:17:07,440][__main__][INFO] - Starting iteration 495. [2025-11-13 11:17:07,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:07,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:16,819][__main__][INFO] - Number of regex retries in iteration 495: 0 [2025-11-13 11:17:16,820][__main__][INFO] - agents played in iteration 495 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:17:17,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:17,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:17,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:17,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:17,368][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:17,368][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:18,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:18,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:19,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:19,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:19,653][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:19,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:20,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:20,637][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:21,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:21,953][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:22,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:22,610][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:22,937][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:23,913][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:24,563][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:25,541][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:26,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:26,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:27,177][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:27,506][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:28,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:28,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:29,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:29,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:29,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:29,861][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:30,674][__main__][INFO] - Iteration 496 took 23s (40.36% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 10m 55s. Estimated total time: 19h 21m 36s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 36s. [2025-11-13 11:17:30,676][__main__][INFO] - Starting iteration 496. [2025-11-13 11:17:30,680][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:30,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:40,359][__main__][INFO] - Number of regex retries in iteration 496: 0 [2025-11-13 11:17:40,360][__main__][INFO] - agents played in iteration 496 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:17:40,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:40,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:40,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:40,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:40,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:40,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:41,628][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:41,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:42,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:42,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:42,911][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:43,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:44,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:44,544][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:44,870][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:45,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:45,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:45,853][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:46,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:46,830][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:47,154][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:47,479][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:47,806][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:48,455][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:48,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:49,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:49,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:50,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:50,409][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:51,059][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:51,384][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:51,709][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:52,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:52,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:53,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:53,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:53,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:54,388][__main__][INFO] - Iteration 497 took 23s (40.83% Gen, 55.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 34m 23s. Estimated total time: 19h 45m 28s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 34s. [2025-11-13 11:17:54,390][__main__][INFO] - Starting iteration 497. [2025-11-13 11:17:54,392][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:54,393][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:04,418][__main__][INFO] - Number of regex retries in iteration 497: 0 [2025-11-13 11:18:04,419][__main__][INFO] - agents played in iteration 497 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:18:04,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:04,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:04,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:04,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:04,982][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:04,982][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:05,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:06,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:06,652][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:06,978][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:07,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:07,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:07,965][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:08,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:08,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:08,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:09,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:10,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:10,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:10,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:11,555][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:11,882][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:12,206][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:12,533][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:12,860][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:13,839][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:14,167][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:14,495][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:15,149][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:15,481][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:15,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:16,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:16,822][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:17,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:17,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:17,519][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:18,337][__main__][INFO] - Iteration 498 took 23s (41.87% Gen, 54.71% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 45m 49s. Estimated total time: 19h 57m 18s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 33s. [2025-11-13 11:18:18,340][__main__][INFO] - Starting iteration 498. [2025-11-13 11:18:18,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:18,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:28,361][__main__][INFO] - Number of regex retries in iteration 498: 0 [2025-11-13 11:18:28,362][__main__][INFO] - agents played in iteration 498 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:18:28,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:28,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:28,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:28,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:28,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:28,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:29,664][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:29,962][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:30,287][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:30,616][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:30,944][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:31,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:31,596][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:31,921][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:33,553][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:34,212][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:34,868][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:35,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:35,845][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:36,176][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:36,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:37,157][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:37,484][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:37,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:38,136][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:38,462][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:39,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:39,762][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:40,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:40,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:41,471][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:41,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:41,475][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:42,313][__main__][INFO] - Iteration 499 took 23s (41.79% Gen, 54.70% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 46m 39s. Estimated total time: 19h 58m 32s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 45s. [2025-11-13 11:18:42,315][__main__][INFO] - Starting iteration 499. [2025-11-13 11:18:42,318][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:42,319][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:51,661][__main__][INFO] - Number of regex retries in iteration 499: 0 [2025-11-13 11:18:51,662][__main__][INFO] - agents played in iteration 499 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:18:52,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:52,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:52,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:52,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:52,218][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:52,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:52,941][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:53,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:53,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:54,553][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:54,878][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:55,537][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:55,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:56,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:56,522][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:56,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:57,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:57,500][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:57,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:58,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:58,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:59,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:00,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:00,446][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:01,423][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:02,074][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:02,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:02,725][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:03,054][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:03,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:04,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:04,755][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:04,757][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:04,759][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:05,571][__main__][INFO] - Iteration 500 took 23s (40.18% Gen, 56.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 10m 26s. Estimated total time: 19h 22m 42s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 47s. [2025-11-13 11:19:05,573][__main__][INFO] - Starting iteration 500. [2025-11-13 11:19:05,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:19:05,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:14,830][__main__][INFO] - Number of regex retries in iteration 500: 0 [2025-11-13 11:19:14,831][__main__][INFO] - agents played in iteration 500 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:19:15,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:15,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:15,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:15,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:15,383][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:15,383][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:16,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:16,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:17,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:17,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:17,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:18,059][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:18,384][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:18,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:19,046][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:19,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:19,696][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:20,344][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:20,668][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:21,317][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:21,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:22,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:22,617][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:22,948][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:23,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:23,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:23,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:24,282][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:24,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:24,937][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:25,588][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:25,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:26,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:26,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:27,253][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:27,950][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:27,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:27,954][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:29,629][__main__][INFO] - Iteration 501 took 24s (38.47% Gen, 54.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 50m 4s. Estimated total time: 20h 2m 43s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 27s. [2025-11-13 11:19:29,632][__main__][INFO] - Starting iteration 501. [2025-11-13 11:19:29,635][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:29,635][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:38,345][__main__][INFO] - Number of regex retries in iteration 501: 0 [2025-11-13 11:19:38,346][__main__][INFO] - agents played in iteration 501 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:19:38,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:38,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:38,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:38,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:38,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:38,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:39,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:40,252][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:40,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:40,906][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:41,567][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:41,891][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:42,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:42,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:42,878][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:43,211][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:43,539][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:43,866][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:44,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:44,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:45,190][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:45,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:45,846][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:46,174][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:46,501][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:46,829][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:47,157][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:47,490][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:48,151][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:48,476][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:49,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:49,459][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:49,782][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:50,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:50,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:51,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:51,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:51,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:52,305][__main__][INFO] - Iteration 502 took 22s (38.42% Gen, 58.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 40m 30s. Estimated total time: 18h 53m 32s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 47s, 500 more iterations: 3h 8m 55s. [2025-11-13 11:19:52,308][__main__][INFO] - Starting iteration 502. [2025-11-13 11:19:52,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:52,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:01,448][__main__][INFO] - Number of regex retries in iteration 502: 0 [2025-11-13 11:20:01,449][__main__][INFO] - agents played in iteration 502 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:20:01,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:01,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:01,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:02,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:02,001][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:02,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:02,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:03,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:03,336][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:03,662][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:03,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:04,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:04,637][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:04,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:05,942][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:06,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:07,255][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:07,586][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:07,914][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:08,239][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:08,564][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:08,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:09,217][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:09,547][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:09,873][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:10,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:10,858][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:11,183][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:11,508][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:12,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:12,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:13,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:13,838][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:14,564][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:14,565][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:14,566][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:15,701][__main__][INFO] - Iteration 503 took 23s (39.06% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 16m 5s. Estimated total time: 19h 29m 30s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 55s. [2025-11-13 11:20:15,703][__main__][INFO] - Starting iteration 503. [2025-11-13 11:20:15,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:15,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:25,177][__main__][INFO] - Number of regex retries in iteration 503: 0 [2025-11-13 11:20:25,178][__main__][INFO] - agents played in iteration 503 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:20:25,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:25,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:25,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:25,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:25,728][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:25,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:26,769][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:27,097][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:27,421][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:28,071][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:28,397][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:28,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:29,055][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:30,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:30,366][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:30,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:31,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:31,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:31,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:31,996][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:33,630][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:33,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:34,283][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:34,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:35,267][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:35,592][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:36,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:36,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:37,585][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:38,290][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:38,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:38,294][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:39,353][__main__][INFO] - Iteration 504 took 23s (40.05% Gen, 55.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 28m 32s. Estimated total time: 19h 42m 21s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 3s. [2025-11-13 11:20:39,355][__main__][INFO] - Starting iteration 504. [2025-11-13 11:20:39,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:39,359][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:48,380][__main__][INFO] - Number of regex retries in iteration 504: 0 [2025-11-13 11:20:48,381][__main__][INFO] - agents played in iteration 504 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:20:48,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:48,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:48,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:48,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:48,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:48,944][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:50,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:51,275][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:52,262][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:52,580][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:53,568][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:53,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:54,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:54,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:55,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:55,839][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:56,165][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:56,490][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:56,816][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:57,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:57,792][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:58,443][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:58,772][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:59,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:00,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:00,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:01,485][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:01,487][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:01,489][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:02,545][__main__][INFO] - Iteration 505 took 23s (38.91% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 10s. Estimated total time: 19h 19m 22s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 13s. [2025-11-13 11:21:02,547][__main__][INFO] - Starting iteration 505. [2025-11-13 11:21:02,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:02,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:11,676][__main__][INFO] - Number of regex retries in iteration 505: 0 [2025-11-13 11:21:11,676][__main__][INFO] - agents played in iteration 505 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:21:12,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:12,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:12,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:12,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:12,239][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:12,240][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:12,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:13,580][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:14,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:14,569][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:14,893][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:15,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:16,529][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:16,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:17,179][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:17,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:18,489][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:19,152][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:19,803][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:21,111][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:21,762][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:22,088][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:22,738][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:23,063][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:23,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:24,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:24,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:24,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:24,806][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:25,672][__main__][INFO] - Iteration 506 took 23s (39.46% Gen, 56.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 1m 31s. Estimated total time: 19h 16m 7s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 41s. [2025-11-13 11:21:25,674][__main__][INFO] - Starting iteration 506. [2025-11-13 11:21:25,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:25,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:35,176][__main__][INFO] - Number of regex retries in iteration 506: 0 [2025-11-13 11:21:35,177][__main__][INFO] - agents played in iteration 506 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:21:35,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:35,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:35,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:35,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:35,808][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:35,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:36,832][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:37,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:37,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:38,460][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:38,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:39,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:39,759][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:40,740][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:41,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:41,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:42,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:42,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:43,021][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:43,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:43,996][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:44,322][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:44,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:44,975][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:45,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:45,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:46,277][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:46,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:46,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:47,602][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:48,322][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:48,323][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:48,325][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:49,304][__main__][INFO] - Iteration 507 took 23s (40.20% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 26m 22s. Estimated total time: 19h 41m 22s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 53s. [2025-11-13 11:21:49,306][__main__][INFO] - Starting iteration 507. [2025-11-13 11:21:49,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:49,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:58,218][__main__][INFO] - Number of regex retries in iteration 507: 0 [2025-11-13 11:21:58,218][__main__][INFO] - agents played in iteration 507 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:21:58,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:58,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:58,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:58,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:58,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:58,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:59,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:00,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:01,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:02,065][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:02,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:02,717][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:03,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:03,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:04,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:04,347][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:04,673][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:04,998][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:05,649][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:05,975][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:06,626][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:07,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:08,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:08,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:09,239][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:09,565][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:09,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:10,577][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:11,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:11,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:11,284][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:12,299][__main__][INFO] - Iteration 508 took 22s (38.75% Gen, 56.83% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 54m 11s. Estimated total time: 19h 9m 33s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 35s. [2025-11-13 11:22:12,301][__main__][INFO] - Starting iteration 508. [2025-11-13 11:22:12,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:12,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:20,830][__main__][INFO] - Number of regex retries in iteration 508: 0 [2025-11-13 11:22:20,830][__main__][INFO] - agents played in iteration 508 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:22:21,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:21,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:21,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:21,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:21,390][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:21,390][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:22,115][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:22,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:22,738][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:23,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:24,043][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:24,371][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:25,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:27,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:27,655][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:27,980][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:28,632][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:28,958][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:29,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:29,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:30,267][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:30,918][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:31,909][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:32,235][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:32,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:33,245][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:33,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:33,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:33,985][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:34,943][__main__][INFO] - Iteration 509 took 22s (37.66% Gen, 58.11% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 36m 11s. Estimated total time: 18h 51m 56s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 39s. [2025-11-13 11:22:34,945][__main__][INFO] - Starting iteration 509. [2025-11-13 11:22:34,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:34,949][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:44,032][__main__][INFO] - Number of regex retries in iteration 509: 0 [2025-11-13 11:22:44,033][__main__][INFO] - agents played in iteration 509 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:22:44,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:44,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:44,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:44,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:44,609][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:44,610][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:45,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:45,953][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:46,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:46,607][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:46,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:48,243][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:48,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:49,219][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:49,548][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:50,214][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:50,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:51,195][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:51,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:51,846][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:52,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:53,147][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:53,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:54,124][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:55,104][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:55,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:56,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:57,162][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:57,164][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:57,165][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:58,079][__main__][INFO] - Iteration 510 took 23s (39.27% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 27s. Estimated total time: 19h 16m 35s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 45s. [2025-11-13 11:22:58,081][__main__][INFO] - Starting iteration 510. [2025-11-13 11:22:58,085][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:58,086][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:06,564][__main__][INFO] - Number of regex retries in iteration 510: 0 [2025-11-13 11:23:06,564][__main__][INFO] - agents played in iteration 510 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:23:07,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:07,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:07,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:07,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:07,127][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:07,128][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:08,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:08,479][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:09,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:09,465][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:09,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:10,122][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:10,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:10,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:11,110][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:11,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:12,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:12,423][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:13,079][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:13,410][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:13,736][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:14,062][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:14,393][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:14,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:15,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:15,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:16,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:16,339][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:16,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:16,991][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:17,318][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:18,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:18,988][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:19,718][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:19,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:19,721][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:21,552][__main__][INFO] - Iteration 511 took 23s (36.13% Gen, 56.06% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 16m 50s. Estimated total time: 19h 33m 22s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 11:23:21,554][__main__][INFO] - Starting iteration 511. [2025-11-13 11:23:21,558][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:21,558][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:30,979][__main__][INFO] - Number of regex retries in iteration 511: 0 [2025-11-13 11:23:30,980][__main__][INFO] - agents played in iteration 511 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:23:31,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:31,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:31,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:31,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:31,551][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:31,551][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:32,269][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:32,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:32,893][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:33,219][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:33,555][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:33,887][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:34,220][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:34,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:35,528][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:36,182][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:36,839][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:37,499][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:38,159][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:39,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:39,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:40,117][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:40,768][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:41,095][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:41,420][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:41,745][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:42,071][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:42,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:43,417][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:44,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:44,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:44,126][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:45,067][__main__][INFO] - Iteration 512 took 23s (40.08% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 37s. Estimated total time: 19h 35m 32s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 55s. [2025-11-13 11:23:45,070][__main__][INFO] - Starting iteration 512. [2025-11-13 11:23:45,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:45,074][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:54,102][__main__][INFO] - Number of regex retries in iteration 512: 0 [2025-11-13 11:23:54,103][__main__][INFO] - agents played in iteration 512 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:23:54,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:54,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:54,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:54,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:54,658][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:54,658][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:55,377][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:55,675][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:56,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:56,652][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:56,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:57,316][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:57,640][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:57,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:58,300][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:58,628][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:58,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:59,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:59,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:00,259][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:00,588][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:00,913][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:01,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:01,910][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:02,228][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:02,554][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:02,879][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:03,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:03,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:04,510][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:04,832][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:05,160][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:05,485][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:05,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:06,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:07,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:07,206][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:07,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:08,063][__main__][INFO] - Iteration 513 took 22s (39.27% Gen, 57.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 52m 14s. Estimated total time: 19h 9m 33s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 35s. [2025-11-13 11:24:08,065][__main__][INFO] - Starting iteration 513. [2025-11-13 11:24:08,069][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:08,069][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:16,355][__main__][INFO] - Number of regex retries in iteration 513: 0 [2025-11-13 11:24:16,356][__main__][INFO] - agents played in iteration 513 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:24:16,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:16,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:16,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:16,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:16,924][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:16,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:17,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:17,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:18,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:18,594][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:18,920][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:19,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:19,901][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:20,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:20,878][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:21,205][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:21,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:22,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:22,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:23,509][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:24,484][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:25,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:25,460][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:25,784][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:26,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:26,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:26,762][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:27,087][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:27,412][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:27,737][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:28,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:28,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:29,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:29,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:29,477][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:30,368][__main__][INFO] - Iteration 514 took 22s (37.16% Gen, 58.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 22s. Estimated total time: 18h 35m 2s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 10s, 500 more iterations: 3h 5m 50s. [2025-11-13 11:24:30,371][__main__][INFO] - Starting iteration 514. [2025-11-13 11:24:30,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:30,375][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:39,047][__main__][INFO] - Number of regex retries in iteration 514: 0 [2025-11-13 11:24:39,048][__main__][INFO] - agents played in iteration 514 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:24:39,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:39,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:39,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:39,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:39,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:39,602][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:40,345][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:40,641][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:41,618][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:41,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:42,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:42,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:42,923][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:43,243][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:44,556][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:44,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:45,210][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:45,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:45,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:46,191][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:46,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:47,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:47,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:48,148][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:48,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:49,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:49,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:49,774][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:50,426][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:50,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:51,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:52,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:52,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:52,176][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:53,109][__main__][INFO] - Iteration 515 took 22s (38.15% Gen, 57.74% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 38m 45s. Estimated total time: 18h 56m 48s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 28s. [2025-11-13 11:24:53,111][__main__][INFO] - Starting iteration 515. [2025-11-13 11:24:53,115][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:53,115][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:01,709][__main__][INFO] - Number of regex retries in iteration 515: 0 [2025-11-13 11:25:01,710][__main__][INFO] - agents played in iteration 515 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:25:02,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:02,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:02,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:02,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:02,264][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:02,264][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:03,287][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:03,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:03,937][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:04,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:04,588][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:05,561][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:05,885][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:06,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:06,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:06,864][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:08,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:08,835][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:09,161][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:09,487][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:09,813][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:10,138][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:10,463][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:10,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:11,126][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:11,451][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:12,104][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:12,432][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:12,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:13,085][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:13,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:14,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:14,820][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:14,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:14,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:15,898][__main__][INFO] - Iteration 516 took 22s (37.72% Gen, 57.56% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 40m 45s. Estimated total time: 18h 59m 11s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 51s. [2025-11-13 11:25:15,900][__main__][INFO] - Starting iteration 516. [2025-11-13 11:25:15,903][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:15,904][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:24,081][__main__][INFO] - Number of regex retries in iteration 516: 0 [2025-11-13 11:25:24,082][__main__][INFO] - agents played in iteration 516 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:25:24,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:24,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:24,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:24,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:24,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:24,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:25,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:26,004][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:26,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:26,980][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:27,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:28,279][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:28,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:28,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:29,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:29,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:31,223][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:33,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:33,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:33,841][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:34,491][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:34,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:35,144][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:35,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:35,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:36,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:37,207][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:37,209][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:37,211][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:38,344][__main__][INFO] - Iteration 517 took 22s (36.44% Gen, 58.50% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 18s. Estimated total time: 18h 42m 6s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 24s, 500 more iterations: 3h 7m 1s. [2025-11-13 11:25:38,347][__main__][INFO] - Starting iteration 517. [2025-11-13 11:25:38,351][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:38,352][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:46,975][__main__][INFO] - Number of regex retries in iteration 517: 0 [2025-11-13 11:25:46,976][__main__][INFO] - agents played in iteration 517 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:25:47,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:47,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:47,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:47,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:47,531][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:47,532][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:48,255][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:48,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:49,856][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:50,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:50,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:50,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:51,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:51,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:51,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:52,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:52,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:52,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:53,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:53,764][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:55,401][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:55,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:56,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:57,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:57,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:57,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:58,013][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:58,338][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:58,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:59,375][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:00,107][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:00,109][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:00,110][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:01,037][__main__][INFO] - Iteration 518 took 22s (38.01% Gen, 57.90% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 35m 9s. Estimated total time: 18h 54m 20s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 48s, 500 more iterations: 3h 9m 3s. [2025-11-13 11:26:01,040][__main__][INFO] - Starting iteration 518. [2025-11-13 11:26:01,045][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:01,046][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:10,267][__main__][INFO] - Number of regex retries in iteration 518: 0 [2025-11-13 11:26:10,267][__main__][INFO] - agents played in iteration 518 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:26:10,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:10,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:10,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:10,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:10,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:10,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:11,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:11,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:12,178][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:12,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:12,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:13,151][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:13,801][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:14,127][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:14,776][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:15,437][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:15,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:16,413][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:16,747][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:17,066][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:17,392][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:18,042][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:18,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:18,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:19,679][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:20,004][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:20,329][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:20,655][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:20,985][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:21,311][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:21,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:22,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:23,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:23,391][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:23,393][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:24,263][__main__][INFO] - Iteration 519 took 23s (39.71% Gen, 56.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 1m 25s. Estimated total time: 19h 21m 0s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 30s. [2025-11-13 11:26:24,265][__main__][INFO] - Starting iteration 519. [2025-11-13 11:26:24,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:24,269][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:32,887][__main__][INFO] - Number of regex retries in iteration 519: 0 [2025-11-13 11:26:32,887][__main__][INFO] - agents played in iteration 519 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:26:33,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:33,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:33,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:33,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:33,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:33,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:34,195][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:34,493][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:34,819][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:35,144][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:35,793][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:36,118][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:36,445][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:36,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:37,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:37,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:38,399][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:38,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:39,701][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:40,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:40,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:41,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:41,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:41,988][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:42,316][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:42,970][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:43,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:43,948][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:44,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:45,273][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:45,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:45,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:45,990][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:47,034][__main__][INFO] - Iteration 520 took 22s (37.86% Gen, 57.55% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 38m 22s. Estimated total time: 18h 58m 19s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 43s. [2025-11-13 11:26:47,036][__main__][INFO] - Starting iteration 520. [2025-11-13 11:26:47,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:47,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:55,417][__main__][INFO] - Number of regex retries in iteration 520: 0 [2025-11-13 11:26:55,417][__main__][INFO] - agents played in iteration 520 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:26:55,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:55,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:55,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:55,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:55,973][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:55,974][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:56,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:57,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:57,341][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:57,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:58,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:59,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:00,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:00,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:01,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:01,902][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:02,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:02,564][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:02,890][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:03,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:03,879][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:04,206][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:04,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:04,857][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:05,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:05,512][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:05,842][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:06,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:06,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:07,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:07,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:08,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:08,592][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:08,594][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:10,336][__main__][INFO] - Iteration 521 took 23s (35.96% Gen, 56.56% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 4m 33s. Estimated total time: 19h 24m 54s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 9s. [2025-11-13 11:27:10,338][__main__][INFO] - Starting iteration 521. [2025-11-13 11:27:10,342][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:10,342][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:18,600][__main__][INFO] - Number of regex retries in iteration 521: 0 [2025-11-13 11:27:18,600][__main__][INFO] - agents played in iteration 521 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:27:19,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:19,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:19,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:19,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:19,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:19,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:19,907][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:20,214][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:20,540][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:20,865][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:22,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:22,816][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:23,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:23,467][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:23,793][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:24,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:24,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:24,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:25,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:25,428][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:25,756][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:26,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:26,407][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:26,734][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:27,060][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:27,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:27,717][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:29,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:29,356][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:30,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:30,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:31,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:31,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:31,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:31,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:32,678][__main__][INFO] - Iteration 522 took 22s (36.97% Gen, 58.93% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 8s. Estimated total time: 18h 36m 51s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 13s, 500 more iterations: 3h 6m 8s. [2025-11-13 11:27:32,680][__main__][INFO] - Starting iteration 522. [2025-11-13 11:27:32,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:32,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:41,302][__main__][INFO] - Number of regex retries in iteration 522: 0 [2025-11-13 11:27:41,303][__main__][INFO] - agents played in iteration 522 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:27:41,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:41,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:41,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:41,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:41,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:41,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:42,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:43,257][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:44,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:44,558][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:44,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:45,216][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:45,866][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:46,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:46,846][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:47,172][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:47,498][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:48,150][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:48,476][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:48,802][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:49,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:49,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:50,119][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:50,445][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:50,771][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:51,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:51,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:51,747][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:52,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:53,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:53,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:54,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:54,489][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:54,491][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:55,375][__main__][INFO] - Iteration 523 took 22s (37.98% Gen, 58.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 33m 31s. Estimated total time: 18h 54m 37s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 6s. [2025-11-13 11:27:55,377][__main__][INFO] - Starting iteration 523. [2025-11-13 11:27:55,380][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:55,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:04,097][__main__][INFO] - Number of regex retries in iteration 523: 0 [2025-11-13 11:28:04,098][__main__][INFO] - agents played in iteration 523 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:28:04,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:04,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:04,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:04,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:04,661][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:04,661][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:05,383][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:05,681][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:06,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:06,666][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:07,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:07,647][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:07,972][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:08,297][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:08,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:08,948][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:09,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:09,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:09,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:10,586][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:11,564][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:11,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:12,217][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:12,552][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:12,877][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:13,203][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:13,529][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:13,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:14,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:14,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:14,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:15,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:15,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:16,525][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:17,232][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:17,234][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:17,236][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:18,179][__main__][INFO] - Iteration 524 took 22s (38.23% Gen, 57.62% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 38m 30s. Estimated total time: 18h 59m 58s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 59s, 500 more iterations: 3h 9m 59s. [2025-11-13 11:28:18,181][__main__][INFO] - Starting iteration 524. [2025-11-13 11:28:18,184][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:18,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:27,128][__main__][INFO] - Number of regex retries in iteration 524: 0 [2025-11-13 11:28:27,129][__main__][INFO] - agents played in iteration 524 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:28:27,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:27,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:27,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:27,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:27,674][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:27,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:28,406][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:29,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:29,364][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:29,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:30,010][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:30,336][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:30,991][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:32,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:32,619][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:32,944][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:33,596][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:33,922][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:34,249][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:35,229][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:35,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:35,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:36,214][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:36,866][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:37,193][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:37,522][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:37,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:38,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:38,504][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:38,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:39,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:40,291][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:40,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:40,295][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:41,176][__main__][INFO] - Iteration 525 took 22s (38.90% Gen, 57.27% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 47m 45s. Estimated total time: 19h 9m 37s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 36s. [2025-11-13 11:28:41,178][__main__][INFO] - Starting iteration 525. [2025-11-13 11:28:41,181][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:41,182][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:49,556][__main__][INFO] - Number of regex retries in iteration 525: 0 [2025-11-13 11:28:49,556][__main__][INFO] - agents played in iteration 525 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:28:49,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:50,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:50,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:50,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:50,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:50,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:50,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:51,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:51,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:51,792][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:52,118][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:52,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:52,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:53,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:53,749][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:54,074][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:54,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:54,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:55,051][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:55,378][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:55,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:56,355][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:56,681][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:57,007][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:57,333][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:57,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:58,637][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:58,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:59,289][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:59,616][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:00,270][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:00,925][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:01,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:01,957][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:02,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:02,695][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:02,696][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:03,737][__main__][INFO] - Iteration 526 took 22s (37.13% Gen, 58.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 36s. Estimated total time: 18h 47m 50s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 35s, 500 more iterations: 3h 7m 58s. [2025-11-13 11:29:03,739][__main__][INFO] - Starting iteration 526. [2025-11-13 11:29:03,743][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:03,743][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:12,086][__main__][INFO] - Number of regex retries in iteration 526: 0 [2025-11-13 11:29:12,086][__main__][INFO] - agents played in iteration 526 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:29:12,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:12,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:12,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:12,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:12,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:12,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:13,371][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:14,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:14,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:15,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:15,958][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:17,271][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:17,597][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:17,925][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:18,256][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:18,582][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:18,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:19,571][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:19,898][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:20,549][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:20,878][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:21,203][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:21,529][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:21,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:22,506][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:22,833][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:23,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:23,486][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:23,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:24,538][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:25,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:25,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:25,226][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:26,144][__main__][INFO] - Iteration 527 took 22s (37.24% Gen, 58.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 28s. Estimated total time: 18h 40m 5s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 20s, 500 more iterations: 3h 6m 40s. [2025-11-13 11:29:26,146][__main__][INFO] - Starting iteration 527. [2025-11-13 11:29:26,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:26,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:34,870][__main__][INFO] - Number of regex retries in iteration 527: 0 [2025-11-13 11:29:34,870][__main__][INFO] - agents played in iteration 527 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:29:35,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:35,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:35,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:35,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:35,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:35,419][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:36,127][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:37,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:37,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:37,702][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:38,029][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:38,683][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:39,661][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:40,316][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:40,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:40,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:41,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:41,628][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:41,953][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:42,932][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:43,909][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:44,235][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:44,575][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:45,552][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:45,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:46,533][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:46,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:47,568][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:48,293][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:48,295][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:48,298][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:49,380][__main__][INFO] - Iteration 528 took 23s (37.54% Gen, 57.80% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 58m 37s. Estimated total time: 19h 21m 37s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 36s. [2025-11-13 11:29:49,382][__main__][INFO] - Starting iteration 528. [2025-11-13 11:29:49,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:49,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:57,514][__main__][INFO] - Number of regex retries in iteration 528: 0 [2025-11-13 11:29:57,514][__main__][INFO] - agents played in iteration 528 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:29:57,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:57,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:58,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:58,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:58,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:58,047][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:58,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:59,056][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:59,708][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:00,033][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:00,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:00,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:01,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:01,672][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:02,000][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:02,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:02,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:02,980][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:03,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:03,961][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:04,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:04,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:06,245][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:06,580][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:06,896][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:07,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:07,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:08,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:08,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:08,851][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:09,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:09,900][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:10,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:10,623][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:10,625][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:11,528][__main__][INFO] - Iteration 529 took 22s (36.70% Gen, 59.21% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 48s. Estimated total time: 18h 27m 10s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 54s, 500 more iterations: 3h 4m 31s. [2025-11-13 11:30:11,530][__main__][INFO] - Starting iteration 529. [2025-11-13 11:30:11,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:11,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:20,224][__main__][INFO] - Number of regex retries in iteration 529: 0 [2025-11-13 11:30:20,225][__main__][INFO] - agents played in iteration 529 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:30:20,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:20,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:20,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:20,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:20,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:20,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:21,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:21,763][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:22,089][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:22,414][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:22,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:23,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:23,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:23,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:25,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:26,008][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:26,333][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:27,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:27,975][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:28,302][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:28,954][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:29,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:29,932][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:30,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:30,912][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:31,565][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:31,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:32,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:33,303][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:33,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:33,307][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:34,125][__main__][INFO] - Iteration 530 took 22s (38.47% Gen, 57.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 52s. Estimated total time: 18h 49m 36s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 39s, 500 more iterations: 3h 8m 16s. [2025-11-13 11:30:34,127][__main__][INFO] - Starting iteration 530. [2025-11-13 11:30:34,131][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:34,131][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:42,667][__main__][INFO] - Number of regex retries in iteration 530: 0 [2025-11-13 11:30:42,668][__main__][INFO] - agents played in iteration 530 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:30:43,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:43,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:43,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:43,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:43,203][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:43,203][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:43,905][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:44,203][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:44,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:44,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:45,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:46,179][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:46,836][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:47,496][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:47,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:48,487][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:49,138][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:49,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:49,790][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:50,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:50,443][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:51,095][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:51,747][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:52,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:52,400][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:52,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:53,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:54,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:55,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:55,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:55,774][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:55,776][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:57,484][__main__][INFO] - Iteration 531 took 23s (36.55% Gen, 56.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 3m 35s. Estimated total time: 19h 27m 42s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 37s. [2025-11-13 11:30:57,486][__main__][INFO] - Starting iteration 531. [2025-11-13 11:30:57,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:30:57,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:01,138][mllm.models.large_language_model_local][WARNING] - Response %A did not match regex: (|), retry 1/1 [2025-11-13 11:31:06,987][__main__][INFO] - Number of regex retries in iteration 531: 1 [2025-11-13 11:31:06,988][__main__][INFO] - agents played in iteration 531 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:31:07,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:07,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:07,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:07,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:07,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:07,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:08,232][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:08,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:08,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:09,176][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:09,497][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:10,145][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:10,469][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:10,795][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:11,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:12,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:12,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:12,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:13,054][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:13,378][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:13,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:14,673][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:14,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:15,322][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:15,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:15,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:16,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:17,583][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:18,233][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:18,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:19,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:19,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:19,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:19,973][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:20,833][__main__][INFO] - Iteration 532 took 23s (40.68% Gen, 55.62% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 2m 44s. Estimated total time: 19h 27m 15s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 32s. [2025-11-13 11:31:20,835][__main__][INFO] - Starting iteration 532. [2025-11-13 11:31:20,838][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:20,838][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:29,941][__main__][INFO] - Number of regex retries in iteration 532: 0 [2025-11-13 11:31:29,941][__main__][INFO] - agents played in iteration 532 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:31:30,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:30,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:30,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:30,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:30,482][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:30,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:31,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:31,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:31,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:33,421][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:34,064][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:34,714][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:35,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:35,676][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:36,323][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:36,647][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:36,969][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:37,302][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:37,618][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:38,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:39,568][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:39,891][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:40,220][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:40,866][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:41,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:42,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:42,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:42,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:42,976][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:43,870][__main__][INFO] - Iteration 533 took 23s (39.52% Gen, 56.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 46m 46s. Estimated total time: 19h 11m 40s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 56s. [2025-11-13 11:31:43,872][__main__][INFO] - Starting iteration 533. [2025-11-13 11:31:43,875][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:43,876][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:52,925][__main__][INFO] - Number of regex retries in iteration 533: 0 [2025-11-13 11:31:52,926][__main__][INFO] - agents played in iteration 533 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:31:53,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:53,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:53,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:53,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:53,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:53,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:55,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:56,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:56,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:57,041][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:57,366][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:57,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:58,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:58,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:58,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:58,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:59,308][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:59,633][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:59,955][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:00,280][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:00,937][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:01,268][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:01,591][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:01,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:02,244][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:02,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:03,223][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:03,870][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:04,192][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:04,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:05,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:05,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:05,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:05,945][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:06,823][__main__][INFO] - Iteration 534 took 22s (39.43% Gen, 56.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 42m 11s. Estimated total time: 19h 7m 28s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 14s. [2025-11-13 11:32:06,825][__main__][INFO] - Starting iteration 534. [2025-11-13 11:32:06,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:06,828][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:16,524][__main__][INFO] - Number of regex retries in iteration 534: 0 [2025-11-13 11:32:16,524][__main__][INFO] - agents played in iteration 534 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:32:16,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:16,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:17,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:17,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:17,066][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:17,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:17,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:18,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:19,342][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:19,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:20,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:21,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:21,621][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:21,943][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:22,266][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:22,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:22,911][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:23,234][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:23,559][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:24,215][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:24,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:24,862][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:25,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:26,506][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:26,836][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:27,159][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:27,808][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:28,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:28,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:29,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:29,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:29,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:30,326][__main__][INFO] - Iteration 535 took 23s (41.26% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 9m 17s. Estimated total time: 19h 34m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 49s. [2025-11-13 11:32:30,328][__main__][INFO] - Starting iteration 535. [2025-11-13 11:32:30,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:30,331][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:39,988][__main__][INFO] - Number of regex retries in iteration 535: 0 [2025-11-13 11:32:39,989][__main__][INFO] - agents played in iteration 535 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:32:40,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:40,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:40,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:40,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:40,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:40,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:41,530][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:41,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:42,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:42,506][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:42,828][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:43,152][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:43,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:43,798][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:44,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:44,449][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:44,772][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:45,095][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:45,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:46,065][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:46,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:46,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:47,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:48,333][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:48,658][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:48,988][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:49,312][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:49,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:49,958][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:50,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:50,950][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:51,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:52,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:53,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:53,004][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:53,005][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:53,850][__main__][INFO] - Iteration 536 took 23s (41.06% Gen, 55.34% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 9m 56s. Estimated total time: 19h 36m 0s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 0s. [2025-11-13 11:32:53,895][__main__][INFO] - Starting iteration 536. [2025-11-13 11:32:53,912][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:53,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:03,083][__main__][INFO] - Number of regex retries in iteration 536: 0 [2025-11-13 11:33:03,084][__main__][INFO] - agents played in iteration 536 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:33:03,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:03,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:03,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:03,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:03,612][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:03,612][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:04,314][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:04,608][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:04,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:05,910][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:07,209][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:07,533][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:07,856][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:08,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:09,158][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:09,486][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:09,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:10,131][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:10,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:10,775][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:11,102][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:11,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:12,070][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:12,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:12,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:13,051][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:13,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:13,704][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:14,030][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:14,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:15,401][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:16,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:16,080][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:16,082][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:16,863][__main__][INFO] - Iteration 537 took 22s (39.93% Gen, 56.60% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 41m 51s. Estimated total time: 19h 8m 18s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 23s. [2025-11-13 11:33:16,865][__main__][INFO] - Starting iteration 537. [2025-11-13 11:33:16,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:16,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:26,311][__main__][INFO] - Number of regex retries in iteration 537: 0 [2025-11-13 11:33:26,312][__main__][INFO] - agents played in iteration 537 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:33:26,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:26,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:26,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:26,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:26,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:26,878][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:27,586][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:28,207][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:28,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:28,852][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:29,175][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:29,498][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:30,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:30,470][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:30,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:31,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:31,765][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:32,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:32,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:33,393][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:33,723][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:34,045][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:34,368][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:34,691][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:35,013][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:35,661][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:35,986][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:36,311][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:36,634][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:36,958][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:37,608][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:37,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:38,646][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:39,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:39,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:39,337][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:40,139][__main__][INFO] - Iteration 538 took 23s (40.58% Gen, 55.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 56m 44s. Estimated total time: 19h 23m 34s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 55s. [2025-11-13 11:33:40,141][__main__][INFO] - Starting iteration 538. [2025-11-13 11:33:40,143][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:40,143][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:49,602][__main__][INFO] - Number of regex retries in iteration 538: 0 [2025-11-13 11:33:49,602][__main__][INFO] - agents played in iteration 538 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:33:50,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:50,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:50,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:50,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:50,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:50,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:50,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:51,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:51,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:52,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:53,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:54,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:54,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:55,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:55,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:55,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:56,038][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:56,687][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:57,657][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:57,982][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:58,635][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:58,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:59,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:00,581][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:00,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:01,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:01,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:02,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:02,684][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:02,686][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:03,502][__main__][INFO] - Iteration 539 took 23s (40.49% Gen, 56.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 46s. Estimated total time: 19h 28m 0s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 40s. [2025-11-13 11:34:03,504][__main__][INFO] - Starting iteration 539. [2025-11-13 11:34:03,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:03,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:13,868][__main__][INFO] - Number of regex retries in iteration 539: 0 [2025-11-13 11:34:13,869][__main__][INFO] - agents played in iteration 539 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:34:14,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:14,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:14,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:14,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:14,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:14,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:15,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:16,090][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:16,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:16,736][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:17,382][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:17,705][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:18,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:18,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:19,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:19,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:19,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:20,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:20,939][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:21,262][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:21,909][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:22,232][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:22,555][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:23,204][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:23,527][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:23,849][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:24,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:24,500][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:24,824][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:25,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:26,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:26,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:26,905][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:26,907][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:27,762][__main__][INFO] - Iteration 540 took 24s (42.72% Gen, 53.75% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 45m 8s. Estimated total time: 20h 12m 46s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 25s, 500 more iterations: 3h 22m 7s. [2025-11-13 11:34:27,764][__main__][INFO] - Starting iteration 540. [2025-11-13 11:34:27,766][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:27,767][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:37,662][__main__][INFO] - Number of regex retries in iteration 540: 0 [2025-11-13 11:34:37,663][__main__][INFO] - agents played in iteration 540 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:34:38,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,211][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:38,211][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:38,942][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:39,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:40,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:40,854][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:41,506][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:42,151][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:42,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:42,804][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:43,780][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:44,756][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:45,080][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:45,731][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:46,060][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:47,358][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:47,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:48,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:48,654][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:49,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:50,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:50,754][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:50,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:50,759][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:52,609][__main__][INFO] - Iteration 541 took 24s (39.83% Gen, 52.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 14m 9s. Estimated total time: 20h 42m 11s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 24s, 500 more iterations: 3h 27m 1s. [2025-11-13 11:34:52,611][__main__][INFO] - Starting iteration 541. [2025-11-13 11:34:52,614][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:34:52,614][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:02,369][__main__][INFO] - Number of regex retries in iteration 541: 0 [2025-11-13 11:35:02,369][__main__][INFO] - agents played in iteration 541 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:35:02,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:02,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:02,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:02,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:02,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:02,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:03,610][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:04,555][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:04,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:05,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:05,854][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:06,508][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:06,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:07,154][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:07,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:08,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:08,774][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:09,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:09,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:09,752][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:11,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:11,692][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:12,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:12,345][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:12,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:12,998][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:13,323][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:13,648][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:13,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:14,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:15,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:15,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:15,411][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:16,413][__main__][INFO] - Iteration 542 took 23s (40.99% Gen, 54.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 21m 34s. Estimated total time: 19h 50m 1s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 20s. [2025-11-13 11:35:16,415][__main__][INFO] - Starting iteration 542. [2025-11-13 11:35:16,418][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:16,418][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:26,253][__main__][INFO] - Number of regex retries in iteration 542: 0 [2025-11-13 11:35:26,254][__main__][INFO] - agents played in iteration 542 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:35:26,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:26,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:26,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:26,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:26,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:26,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:27,514][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:27,810][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:28,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:29,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:29,755][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:31,048][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:32,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:32,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:32,988][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:33,962][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:34,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:35,586][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:36,888][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:37,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:37,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:38,581][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:39,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:39,328][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:39,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:40,150][__main__][INFO] - Iteration 543 took 23s (41.44% Gen, 55.10% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 17m 47s. Estimated total time: 19h 46m 38s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 46s. [2025-11-13 11:35:40,151][__main__][INFO] - Starting iteration 543. [2025-11-13 11:35:40,154][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:40,155][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:51,063][__main__][INFO] - Number of regex retries in iteration 543: 0 [2025-11-13 11:35:51,063][__main__][INFO] - agents played in iteration 543 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:35:51,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:51,624][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:52,343][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:52,644][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:52,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:54,931][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:55,259][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:55,575][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:55,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:56,222][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:56,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:57,514][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:58,158][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:58,805][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:59,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:59,451][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:59,774][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:00,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:00,750][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:01,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:01,397][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:02,370][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:02,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:03,401][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:04,117][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:04,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:04,120][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:04,973][__main__][INFO] - Iteration 544 took 24s (43.95% Gen, 52.61% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 11m 42s. Estimated total time: 20h 40m 58s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 21s, 500 more iterations: 3h 26m 49s. [2025-11-13 11:36:04,975][__main__][INFO] - Starting iteration 544. [2025-11-13 11:36:04,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:04,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:15,938][__main__][INFO] - Number of regex retries in iteration 544: 0 [2025-11-13 11:36:15,939][__main__][INFO] - agents played in iteration 544 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:36:16,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:16,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:17,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:19,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:20,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:20,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:20,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:21,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:21,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:21,713][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:22,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:22,689][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:23,011][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:23,334][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:24,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:24,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:25,276][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:25,599][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:25,922][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:26,568][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:26,893][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:27,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:28,255][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:28,977][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:28,978][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:28,980][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:30,140][__main__][INFO] - Iteration 545 took 25s (43.56% Gen, 51.83% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 28m 26s. Estimated total time: 20h 58m 6s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 56s, 500 more iterations: 3h 29m 41s. [2025-11-13 11:36:30,142][__main__][INFO] - Starting iteration 545. [2025-11-13 11:36:30,145][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:30,145][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:40,590][__main__][INFO] - Number of regex retries in iteration 545: 0 [2025-11-13 11:36:40,591][__main__][INFO] - agents played in iteration 545 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:36:41,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,130][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:41,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:41,840][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:42,459][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:42,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:43,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:44,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:44,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:45,045][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:45,367][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:45,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:46,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:47,312][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:47,636][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:47,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:48,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:48,610][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:48,934][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:49,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:49,580][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:50,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:50,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:51,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:51,524][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:51,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:52,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:52,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:53,607][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:53,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:53,611][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:54,437][__main__][INFO] - Iteration 546 took 24s (43.00% Gen, 53.59% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 44m 34s. Estimated total time: 20h 14m 39s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 29s, 500 more iterations: 3h 22m 26s. [2025-11-13 11:36:54,439][__main__][INFO] - Starting iteration 546. [2025-11-13 11:36:54,441][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:54,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:04,896][__main__][INFO] - Number of regex retries in iteration 546: 0 [2025-11-13 11:37:04,896][__main__][INFO] - agents played in iteration 546 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:37:05,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:05,443][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:06,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:06,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:06,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:07,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:07,746][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:08,394][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:08,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:09,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:09,366][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:09,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:10,010][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:10,658][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:10,982][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:11,305][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:11,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:11,952][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:12,275][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:12,599][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:13,900][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:14,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:14,883][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:15,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:15,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:16,184][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:16,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:17,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:17,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:17,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:17,918][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:18,967][__main__][INFO] - Iteration 547 took 24s (42.62% Gen, 53.09% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 55m 51s. Estimated total time: 20h 26m 20s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 52s, 500 more iterations: 3h 24m 23s. [2025-11-13 11:37:18,969][__main__][INFO] - Starting iteration 547. [2025-11-13 11:37:18,972][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:18,972][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:29,153][__main__][INFO] - Number of regex retries in iteration 547: 0 [2025-11-13 11:37:29,154][__main__][INFO] - agents played in iteration 547 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:37:29,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:29,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:29,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:29,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:29,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:29,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:30,415][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:30,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:31,040][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:31,364][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:32,013][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:32,335][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:32,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:33,624][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:33,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:34,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:35,246][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:36,213][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:36,867][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:37,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:37,514][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:37,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:38,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:38,815][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:39,466][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:39,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:40,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:40,437][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:40,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:41,449][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:42,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:42,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:42,168][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:43,185][__main__][INFO] - Iteration 548 took 24s (42.04% Gen, 53.75% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 39m 50s. Estimated total time: 20h 10m 43s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 21s, 500 more iterations: 3h 21m 47s. [2025-11-13 11:37:43,187][__main__][INFO] - Starting iteration 548. [2025-11-13 11:37:43,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:43,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:53,669][__main__][INFO] - Number of regex retries in iteration 548: 0 [2025-11-13 11:37:53,670][__main__][INFO] - agents played in iteration 548 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:37:54,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,208][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:54,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:55,217][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:55,542][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:55,868][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:56,193][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:56,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:56,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:57,169][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:57,494][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:57,819][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:58,143][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:58,467][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:58,792][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:59,112][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:59,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:59,760][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:01,387][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:01,700][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:02,023][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:02,346][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:02,677][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:02,993][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:03,315][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:03,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:04,615][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:04,938][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:05,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:05,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:06,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:06,706][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:06,708][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:07,570][__main__][INFO] - Iteration 549 took 24s (42.98% Gen, 53.48% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 47m 42s. Estimated total time: 20h 19m 0s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 38s, 500 more iterations: 3h 23m 10s. [2025-11-13 11:38:07,571][__main__][INFO] - Starting iteration 549. [2025-11-13 11:38:07,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:07,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:18,273][__main__][INFO] - Number of regex retries in iteration 549: 0 [2025-11-13 11:38:18,273][__main__][INFO] - agents played in iteration 549 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:38:18,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:18,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:18,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:18,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:18,808][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:18,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:19,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:20,139][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:20,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:21,112][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:21,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:22,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:22,412][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:23,058][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:23,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:23,706][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:24,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:24,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:24,677][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:24,999][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:25,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:25,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:26,949][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:27,274][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:28,247][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:28,570][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:28,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:29,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:29,542][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:29,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:30,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:31,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:31,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:31,256][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:32,097][__main__][INFO] - Iteration 550 took 24s (43.63% Gen, 52.94% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 54m 27s. Estimated total time: 20h 26m 10s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 52s, 500 more iterations: 3h 24m 21s. [2025-11-13 11:38:32,099][__main__][INFO] - Starting iteration 550. [2025-11-13 11:38:32,101][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:32,102][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:42,610][__main__][INFO] - Number of regex retries in iteration 550: 0 [2025-11-13 11:38:42,611][__main__][INFO] - agents played in iteration 550 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:38:43,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:43,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:43,857][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:44,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:44,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:45,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:45,769][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:46,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:46,416][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:46,738][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:47,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:47,708][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:48,031][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:48,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:49,004][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:49,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:49,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:50,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:50,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:51,271][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:51,594][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:52,241][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:53,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:54,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:54,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:55,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:55,611][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:55,613][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:57,157][__main__][INFO] - Iteration 551 took 25s (41.94% Gen, 51.89% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 20m 41s. Estimated total time: 20h 52m 49s. Time estimates for 10 more iterations: 4m 10s, 100 more iterations: 41m 45s, 500 more iterations: 3h 28m 48s. [2025-11-13 11:38:57,159][__main__][INFO] - Starting iteration 551. [2025-11-13 11:38:57,163][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:38:57,164][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:07,605][__main__][INFO] - Number of regex retries in iteration 551: 0 [2025-11-13 11:39:07,605][__main__][INFO] - agents played in iteration 551 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:39:08,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,169][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:08,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:08,886][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:09,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:09,511][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:09,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:10,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:11,126][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:11,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:12,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:12,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:13,074][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:13,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:14,051][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:14,374][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:15,349][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:16,007][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:16,324][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:16,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:17,622][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:17,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:18,272][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:18,597][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:18,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:19,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:19,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:20,666][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:20,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:20,669][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:21,528][__main__][INFO] - Iteration 552 took 24s (42.85% Gen, 53.62% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 45m 44s. Estimated total time: 20h 18m 16s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 2s. [2025-11-13 11:39:21,530][__main__][INFO] - Starting iteration 552. [2025-11-13 11:39:21,533][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:21,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:32,276][__main__][INFO] - Number of regex retries in iteration 552: 0 [2025-11-13 11:39:32,277][__main__][INFO] - agents played in iteration 552 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:39:32,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:32,817][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:33,542][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:33,837][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:34,165][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:34,489][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:35,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:36,426][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:36,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:37,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:37,721][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:38,045][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:38,369][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:38,692][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:39,015][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:39,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:39,661][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:39,984][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:40,306][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:40,952][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:41,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:41,607][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:41,929][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:42,252][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:42,575][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:42,898][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:43,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:43,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:44,582][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:45,305][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:45,306][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:45,308][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:46,196][__main__][INFO] - Iteration 553 took 24s (43.56% Gen, 52.84% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 0m 17s. Estimated total time: 20h 33m 13s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 6s, 500 more iterations: 3h 25m 32s. [2025-11-13 11:39:46,199][__main__][INFO] - Starting iteration 553. [2025-11-13 11:39:46,202][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:46,203][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:56,654][__main__][INFO] - Number of regex retries in iteration 553: 0 [2025-11-13 11:39:56,655][__main__][INFO] - agents played in iteration 553 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:39:57,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:57,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:57,913][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:58,207][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:58,855][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:59,180][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:59,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:59,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:00,153][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:01,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:01,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:01,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:02,097][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:02,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:02,735][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:03,059][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:03,387][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:04,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:04,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:05,320][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:05,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:05,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:06,289][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:07,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:08,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:08,948][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:09,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:09,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:09,665][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:10,511][__main__][INFO] - Iteration 554 took 24s (43.00% Gen, 53.51% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 42m 9s. Estimated total time: 20h 15m 29s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 30s, 500 more iterations: 3h 22m 34s. [2025-11-13 11:40:10,513][__main__][INFO] - Starting iteration 554. [2025-11-13 11:40:10,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:10,516][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:21,629][__main__][INFO] - Number of regex retries in iteration 554: 0 [2025-11-13 11:40:21,630][__main__][INFO] - agents played in iteration 554 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:40:22,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:22,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:22,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:22,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:22,171][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:22,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:22,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:23,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:23,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:23,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:24,155][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:24,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:24,808][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:25,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:25,780][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:26,103][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:26,425][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:27,391][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:27,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:28,359][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:28,682][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:29,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:29,327][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:29,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:29,978][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:30,624][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:31,276][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:31,599][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:31,922][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:32,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:32,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:33,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:33,946][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:34,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:34,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:34,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:35,541][__main__][INFO] - Iteration 555 took 25s (44.41% Gen, 52.06% Train). Generation: 11s, Training: 13s. Estimated remaining time: 20h 17m 33s. Estimated total time: 20h 51m 19s. Time estimates for 10 more iterations: 4m 10s, 100 more iterations: 41m 42s, 500 more iterations: 3h 28m 33s. [2025-11-13 11:40:35,544][__main__][INFO] - Starting iteration 555. [2025-11-13 11:40:35,546][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:35,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:46,482][__main__][INFO] - Number of regex retries in iteration 555: 0 [2025-11-13 11:40:46,483][__main__][INFO] - agents played in iteration 555 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:40:46,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:46,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:46,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:47,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:47,014][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:47,014][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:47,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:48,006][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:48,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:48,654][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:48,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:49,623][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:49,945][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:50,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:50,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:51,234][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:51,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:51,884][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:52,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:52,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:53,178][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:53,822][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:54,145][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:54,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:54,791][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:55,437][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:55,761][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:56,089][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:56,418][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:56,744][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:57,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:57,390][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:58,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:58,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:59,450][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:59,451][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:59,452][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:00,242][__main__][INFO] - Iteration 556 took 24s (44.28% Gen, 52.52% Train). Generation: 10s, Training: 12s. Estimated remaining time: 20h 0m 39s. Estimated total time: 20h 34m 50s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 9s, 500 more iterations: 3h 25m 48s. [2025-11-13 11:41:00,244][__main__][INFO] - Starting iteration 556. [2025-11-13 11:41:00,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:00,247][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:11,257][__main__][INFO] - Number of regex retries in iteration 556: 0 [2025-11-13 11:41:11,257][__main__][INFO] - agents played in iteration 556 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:41:11,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:11,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:11,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:11,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:11,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:11,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:12,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:12,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:13,128][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:13,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:13,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:14,095][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:14,417][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:14,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:15,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:15,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:15,708][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:16,031][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:16,354][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:16,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:17,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:17,326][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:17,649][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:17,972][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:18,297][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:18,620][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:18,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:19,267][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:19,590][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:19,912][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:20,235][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:20,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:21,205][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:21,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:21,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:22,176][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:22,501][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:22,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:23,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:24,250][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:24,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:24,253][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:25,070][__main__][INFO] - Iteration 557 took 24s (44.35% Gen, 52.35% Train). Generation: 11s, Training: 12s. Estimated remaining time: 20h 6m 37s. Estimated total time: 20h 41m 12s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 22s, 500 more iterations: 3h 26m 52s. [2025-11-13 11:41:25,072][__main__][INFO] - Starting iteration 557. [2025-11-13 11:41:25,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:25,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:35,561][__main__][INFO] - Number of regex retries in iteration 557: 0 [2025-11-13 11:41:35,562][__main__][INFO] - agents played in iteration 557 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:41:36,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:36,124][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:37,132][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:37,778][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:38,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:38,425][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:39,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:39,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:39,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:40,038][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:40,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:41,007][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:41,330][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:41,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:42,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:42,948][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:43,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:43,916][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:44,237][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:44,564][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:44,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:45,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:45,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:45,869][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:46,195][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:46,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:46,848][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:47,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:47,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:48,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:48,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:48,598][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:49,397][__main__][INFO] - Iteration 558 took 24s (43.11% Gen, 53.60% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 41m 10s. Estimated total time: 20h 16m 9s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 32s, 500 more iterations: 3h 22m 41s. [2025-11-13 11:41:49,399][__main__][INFO] - Starting iteration 558. [2025-11-13 11:41:49,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:49,402][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:00,239][__main__][INFO] - Number of regex retries in iteration 558: 0 [2025-11-13 11:42:00,239][__main__][INFO] - agents played in iteration 558 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:42:00,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:00,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:01,492][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:01,785][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:02,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:02,436][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:03,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:03,747][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:04,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:04,401][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:04,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:06,025][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:06,353][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:06,678][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:07,007][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:07,658][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:07,983][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:08,305][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:08,631][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:08,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:09,281][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:09,605][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:10,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:11,241][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:11,565][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:11,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:12,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:13,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:13,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:13,307][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:14,129][__main__][INFO] - Iteration 559 took 24s (43.82% Gen, 52.84% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 1m 0s. Estimated total time: 20h 36m 24s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 12s, 500 more iterations: 3h 26m 4s. [2025-11-13 11:42:14,131][__main__][INFO] - Starting iteration 559. [2025-11-13 11:42:14,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:14,134][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:23,911][__main__][INFO] - Number of regex retries in iteration 559: 0 [2025-11-13 11:42:23,912][__main__][INFO] - agents played in iteration 559 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:42:24,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,462][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:24,463][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:25,164][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:25,459][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:25,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:26,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:26,765][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:27,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:27,737][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:28,388][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:28,711][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:29,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:29,364][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:29,689][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:30,340][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:30,669][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:31,319][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:31,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:32,291][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:32,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:32,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:33,266][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:33,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:33,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:34,243][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:34,898][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:35,223][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:35,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:36,276][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:36,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:36,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:36,954][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:38,004][__main__][INFO] - Iteration 560 took 23s (40.96% Gen, 54.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 17m 43s. Estimated total time: 19h 53m 31s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 55s. [2025-11-13 11:42:38,006][__main__][INFO] - Starting iteration 560. [2025-11-13 11:42:38,008][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:38,009][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:47,702][__main__][INFO] - Number of regex retries in iteration 560: 0 [2025-11-13 11:42:47,702][__main__][INFO] - agents played in iteration 560 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:42:48,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:48,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:48,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:48,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:48,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:48,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:48,927][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:49,221][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:49,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:50,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:50,514][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:50,837][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:51,162][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:51,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:51,808][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:52,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:53,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:53,752][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:54,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:54,399][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:54,723][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:55,047][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:55,370][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:55,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:56,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:56,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:58,618][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:58,943][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:59,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:59,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:00,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:00,662][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:00,664][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:02,210][__main__][INFO] - Iteration 561 took 24s (40.05% Gen, 53.55% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 33m 53s. Estimated total time: 20h 10m 6s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 20s, 500 more iterations: 3h 21m 41s. [2025-11-13 11:43:02,211][__main__][INFO] - Starting iteration 561. [2025-11-13 11:43:02,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:02,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:11,754][__main__][INFO] - Number of regex retries in iteration 561: 0 [2025-11-13 11:43:11,755][__main__][INFO] - agents played in iteration 561 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:43:12,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:12,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:12,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:12,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:12,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:12,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:13,301][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:13,625][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:13,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:14,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:14,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:14,920][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:15,244][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:15,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:16,216][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:17,185][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:17,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:17,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:18,154][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:18,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:19,124][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:19,448][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:19,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:20,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:20,743][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:21,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:21,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:21,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:22,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:22,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:23,011][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:23,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:24,059][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:24,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:24,749][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:24,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:25,575][__main__][INFO] - Iteration 562 took 23s (40.83% Gen, 55.63% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 51m 28s. Estimated total time: 19h 28m 4s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 40s. [2025-11-13 11:43:25,577][__main__][INFO] - Starting iteration 562. [2025-11-13 11:43:25,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:25,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:35,744][__main__][INFO] - Number of regex retries in iteration 562: 0 [2025-11-13 11:43:35,745][__main__][INFO] - agents played in iteration 562 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:43:36,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:36,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:36,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:36,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:36,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:36,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:36,958][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:37,252][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:37,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:37,905][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:38,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:38,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:38,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:39,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:40,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:40,826][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:41,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:42,138][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:42,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:42,785][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:43,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:43,432][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:43,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:44,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:44,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:45,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:45,380][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:46,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:46,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:47,002][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:47,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:48,032][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:48,740][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:48,742][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:48,743][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:49,511][__main__][INFO] - Iteration 563 took 23s (42.47% Gen, 54.31% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 19m 36s. Estimated total time: 19h 56m 36s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 53s, 500 more iterations: 3h 19m 26s. [2025-11-13 11:43:49,513][__main__][INFO] - Starting iteration 563. [2025-11-13 11:43:49,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:49,516][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:00,011][__main__][INFO] - Number of regex retries in iteration 563: 0 [2025-11-13 11:44:00,011][__main__][INFO] - agents played in iteration 563 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:44:00,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:00,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:00,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:00,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:00,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:00,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:01,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:01,877][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:02,202][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:02,532][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:02,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:03,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:03,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:03,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:04,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:05,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:05,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:06,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:06,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:06,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:07,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:07,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:07,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:08,027][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:08,361][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:08,678][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:09,003][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:09,325][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:09,651][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:09,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:10,305][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:11,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:11,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:12,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:12,990][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:12,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:12,994][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:13,973][__main__][INFO] - Iteration 564 took 24s (42.91% Gen, 53.08% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 45m 29s. Estimated total time: 20h 22m 54s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 45s, 500 more iterations: 3h 23m 49s. [2025-11-13 11:44:13,975][__main__][INFO] - Starting iteration 564. [2025-11-13 11:44:13,977][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:13,978][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:23,817][__main__][INFO] - Number of regex retries in iteration 564: 0 [2025-11-13 11:44:23,818][__main__][INFO] - agents played in iteration 564 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:44:24,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:24,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:24,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:24,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:24,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:24,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:25,078][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:25,374][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:25,702][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:26,028][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:26,683][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:27,007][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:27,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:27,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:28,304][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:29,943][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:30,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:31,564][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:31,886][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:32,209][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:32,534][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:32,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:33,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:33,503][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:33,828][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:34,152][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:34,475][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:35,127][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:35,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:36,160][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:36,841][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:36,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:36,844][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:37,669][__main__][INFO] - Iteration 565 took 23s (41.53% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 6m 49s. Estimated total time: 19h 44m 37s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 26s. [2025-11-13 11:44:37,671][__main__][INFO] - Starting iteration 565. [2025-11-13 11:44:37,673][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:37,674][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:47,096][__main__][INFO] - Number of regex retries in iteration 565: 0 [2025-11-13 11:44:47,096][__main__][INFO] - agents played in iteration 565 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:44:47,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:47,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:47,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:47,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:47,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:47,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:48,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:48,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:48,947][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:49,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:49,602][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:49,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:50,253][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:50,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:50,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:51,223][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:51,546][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:51,868][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:52,193][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:53,489][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:54,786][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:55,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:56,079][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:56,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:56,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:57,053][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:57,377][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:57,702][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:58,025][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:58,348][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:58,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:59,394][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:00,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:00,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:00,079][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:00,981][__main__][INFO] - Iteration 566 took 23s (40.42% Gen, 55.70% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 47m 15s. Estimated total time: 19h 25m 26s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 14s. [2025-11-13 11:45:00,983][__main__][INFO] - Starting iteration 566. [2025-11-13 11:45:00,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:00,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:10,987][__main__][INFO] - Number of regex retries in iteration 566: 0 [2025-11-13 11:45:10,988][__main__][INFO] - agents played in iteration 566 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:45:11,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:11,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:11,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:11,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:11,509][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:11,509][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:13,162][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:13,480][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:14,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:15,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:15,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:15,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:17,353][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:17,676][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:17,998][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:18,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:18,968][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:19,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:19,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:20,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:21,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:21,563][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:21,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:22,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:22,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:23,264][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:23,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:23,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:23,958][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:24,800][__main__][INFO] - Iteration 567 took 23s (41.99% Gen, 54.46% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 12m 8s. Estimated total time: 19h 50m 43s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 27s. [2025-11-13 11:45:24,802][__main__][INFO] - Starting iteration 567. [2025-11-13 11:45:24,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:24,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:35,090][__main__][INFO] - Number of regex retries in iteration 567: 0 [2025-11-13 11:45:35,091][__main__][INFO] - agents played in iteration 567 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:45:35,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:35,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:35,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:35,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:35,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:35,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:36,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:37,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:37,914][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:38,237][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:38,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:38,881][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:39,525][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:40,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:40,504][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:40,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:41,149][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:41,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:41,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:42,117][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:42,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:42,762][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:43,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:43,741][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:44,394][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:44,716][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:45,054][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:45,701][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:46,359][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:46,683][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:47,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:47,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:48,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:48,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:48,425][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:49,257][__main__][INFO] - Iteration 568 took 24s (42.06% Gen, 54.53% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 43m 42s. Estimated total time: 20h 22m 41s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 45s, 500 more iterations: 3h 23m 46s. [2025-11-13 11:45:49,259][__main__][INFO] - Starting iteration 568. [2025-11-13 11:45:49,262][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:49,263][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:59,169][__main__][INFO] - Number of regex retries in iteration 568: 0 [2025-11-13 11:45:59,170][__main__][INFO] - agents played in iteration 568 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:45:59,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:59,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:00,404][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:00,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:01,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:02,001][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:02,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:02,646][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:02,968][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:03,293][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:03,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:03,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:04,257][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:04,584][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:05,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:06,194][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:06,518][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:06,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:07,176][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:07,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:07,815][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:08,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:08,785][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:09,110][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:10,088][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:10,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:11,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:12,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:12,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:12,136][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:12,949][__main__][INFO] - Iteration 569 took 23s (41.82% Gen, 54.74% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 5m 0s. Estimated total time: 19h 44m 23s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 23s. [2025-11-13 11:46:12,951][__main__][INFO] - Starting iteration 569. [2025-11-13 11:46:12,954][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:12,955][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:22,766][__main__][INFO] - Number of regex retries in iteration 569: 0 [2025-11-13 11:46:22,766][__main__][INFO] - agents played in iteration 569 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:46:23,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:23,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:24,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:24,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:25,308][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:25,631][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:25,953][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:26,276][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:26,927][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:27,251][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:27,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:28,231][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:28,880][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:29,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:29,525][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:29,847][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:30,169][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:30,493][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:30,817][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:31,787][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:32,110][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:32,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:32,761][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:33,086][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:33,409][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:34,055][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:34,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:35,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:35,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:35,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:35,801][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:36,605][__main__][INFO] - Iteration 570 took 23s (41.48% Gen, 55.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 2m 49s. Estimated total time: 19h 42m 35s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 5s. [2025-11-13 11:46:36,608][__main__][INFO] - Starting iteration 570. [2025-11-13 11:46:36,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:36,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:46,442][__main__][INFO] - Number of regex retries in iteration 570: 0 [2025-11-13 11:46:46,443][__main__][INFO] - agents played in iteration 570 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:46:46,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:46,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:46,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:46,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:46,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:46,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:47,709][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:48,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:48,971][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:49,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:50,259][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:50,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:51,228][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:51,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:51,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:52,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:52,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:53,484][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:53,807][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:54,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:54,454][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:55,422][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:56,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:56,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:57,040][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:57,687][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:58,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:58,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:59,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:59,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:59,433][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:01,111][__main__][INFO] - Iteration 571 took 24s (40.13% Gen, 53.02% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 44m 54s. Estimated total time: 20h 25m 6s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 50s, 500 more iterations: 3h 24m 11s. [2025-11-13 11:47:01,113][__main__][INFO] - Starting iteration 571. [2025-11-13 11:47:01,116][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:01,116][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:10,743][__main__][INFO] - Number of regex retries in iteration 571: 0 [2025-11-13 11:47:10,744][__main__][INFO] - agents played in iteration 571 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:47:11,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,282][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:11,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:12,302][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:12,625][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:12,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:13,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:13,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:13,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:14,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:14,889][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:15,211][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:15,856][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:16,178][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:16,501][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:16,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:17,147][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:17,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:18,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:18,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:18,767][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:19,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:19,742][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:20,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:21,039][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:21,365][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:21,689][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:22,013][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:22,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:23,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:23,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:23,780][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:23,781][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:24,574][__main__][INFO] - Iteration 572 took 23s (41.04% Gen, 55.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 52m 23s. Estimated total time: 19h 32m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 29s. [2025-11-13 11:47:24,576][__main__][INFO] - Starting iteration 572. [2025-11-13 11:47:24,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:24,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:34,265][__main__][INFO] - Number of regex retries in iteration 572: 0 [2025-11-13 11:47:34,266][__main__][INFO] - agents played in iteration 572 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:47:34,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:34,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:34,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:34,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:34,788][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:34,789][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:35,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:35,904][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:36,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:37,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:38,192][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:38,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:38,839][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:39,489][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:39,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:40,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:40,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:41,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:42,087][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:42,415][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:42,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:43,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:43,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:43,709][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:44,032][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:44,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:45,327][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:45,650][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:45,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:46,682][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:47,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:47,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:47,388][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:48,171][__main__][INFO] - Iteration 573 took 23s (41.05% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 39s. Estimated total time: 19h 39m 37s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 36s. [2025-11-13 11:47:48,172][__main__][INFO] - Starting iteration 573. [2025-11-13 11:47:48,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:48,176][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:58,005][__main__][INFO] - Number of regex retries in iteration 573: 0 [2025-11-13 11:47:58,005][__main__][INFO] - agents played in iteration 573 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:47:58,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:58,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:58,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:58,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:58,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:58,538][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:59,241][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:59,536][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:59,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:00,507][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:00,830][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:01,154][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:01,483][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:02,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:02,444][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:02,773][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:03,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:03,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:04,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:05,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:05,369][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:05,690][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:06,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:07,309][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:07,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:08,282][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:08,606][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:09,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:09,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:10,303][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:11,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:11,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:11,008][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:11,981][__main__][INFO] - Iteration 574 took 23s (41.29% Gen, 54.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 8m 56s. Estimated total time: 19h 50m 18s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 23s. [2025-11-13 11:48:11,983][__main__][INFO] - Starting iteration 574. [2025-11-13 11:48:11,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:11,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:22,130][__main__][INFO] - Number of regex retries in iteration 574: 0 [2025-11-13 11:48:22,131][__main__][INFO] - agents played in iteration 574 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:48:22,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:22,676][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:23,378][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:23,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:24,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:24,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:24,972][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:25,295][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:25,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:25,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:26,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:26,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:26,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:27,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:28,221][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:29,195][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:29,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:29,846][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:30,169][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:30,497][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:30,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:31,792][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:32,438][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:32,760][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:33,407][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:33,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:34,451][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:35,128][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:35,129][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:35,131][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:35,963][__main__][INFO] - Iteration 575 took 23s (42.30% Gen, 54.22% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 17m 7s. Estimated total time: 19h 58m 53s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 48s. [2025-11-13 11:48:35,964][__main__][INFO] - Starting iteration 575. [2025-11-13 11:48:35,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:35,967][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:45,256][__main__][INFO] - Number of regex retries in iteration 575: 0 [2025-11-13 11:48:45,257][__main__][INFO] - agents played in iteration 575 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:48:45,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,788][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:45,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:46,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:47,516][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:48,168][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:48,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:48,814][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:49,137][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:49,786][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:50,110][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:51,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:51,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:51,730][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:52,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:52,700][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:53,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:53,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:54,001][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:54,319][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:54,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:54,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:55,298][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:55,613][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:55,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:56,260][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:56,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:57,628][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:58,325][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:58,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:58,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:59,134][__main__][INFO] - Iteration 576 took 23s (40.09% Gen, 56.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 36m 13s. Estimated total time: 19h 18m 22s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 3s. [2025-11-13 11:48:59,136][__main__][INFO] - Starting iteration 576. [2025-11-13 11:48:59,139][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:59,140][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:08,377][__main__][INFO] - Number of regex retries in iteration 576: 0 [2025-11-13 11:49:08,377][__main__][INFO] - agents played in iteration 576 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:49:08,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:08,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:08,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:08,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:08,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:08,911][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:09,640][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:09,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:10,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:10,913][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:11,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:11,893][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:12,218][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:12,878][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:13,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:14,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:14,501][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:14,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:15,149][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:16,133][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:16,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:16,792][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:17,115][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:17,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:18,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:18,419][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:18,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:19,068][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:19,717][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:20,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:20,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:21,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:21,478][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:21,481][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:22,319][__main__][INFO] - Iteration 577 took 23s (39.84% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 36m 32s. Estimated total time: 19h 19m 5s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 10s. [2025-11-13 11:49:22,321][__main__][INFO] - Starting iteration 577. [2025-11-13 11:49:22,324][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:22,324][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:31,809][__main__][INFO] - Number of regex retries in iteration 577: 0 [2025-11-13 11:49:31,809][__main__][INFO] - agents played in iteration 577 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:49:32,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,351][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:32,351][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:33,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:33,725][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:34,049][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:34,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:34,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:35,019][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:35,343][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:35,667][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:35,990][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:36,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:36,638][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:37,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:38,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:39,230][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:39,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:40,535][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:41,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:41,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:42,156][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:42,479][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:43,127][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:43,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:44,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:44,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:44,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:44,873][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:45,817][__main__][INFO] - Iteration 578 took 23s (40.37% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 51m 45s. Estimated total time: 19h 34m 41s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 46s. [2025-11-13 11:49:45,819][__main__][INFO] - Starting iteration 578. [2025-11-13 11:49:45,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:45,823][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:55,655][__main__][INFO] - Number of regex retries in iteration 578: 0 [2025-11-13 11:49:55,655][__main__][INFO] - agents played in iteration 578 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:49:56,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,237][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:56,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:56,957][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:57,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:57,578][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:58,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:58,877][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:59,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:59,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:59,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:00,170][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:00,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:00,826][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:01,149][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:01,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:01,803][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:02,456][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:02,783][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:04,084][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:04,410][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:05,383][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:05,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:06,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:06,356][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:06,680][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:07,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:07,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:08,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:08,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:08,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:08,761][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:09,630][__main__][INFO] - Iteration 579 took 23s (41.30% Gen, 55.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 5s. Estimated total time: 19h 50m 25s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 24s. [2025-11-13 11:50:09,632][__main__][INFO] - Starting iteration 579. [2025-11-13 11:50:09,635][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:09,635][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:19,013][__main__][INFO] - Number of regex retries in iteration 579: 0 [2025-11-13 11:50:19,014][__main__][INFO] - agents played in iteration 579 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:50:19,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,557][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:19,557][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:20,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:21,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:21,884][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:22,207][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:22,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:22,855][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:23,178][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:23,502][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:24,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:24,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:26,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:26,425][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:26,740][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:27,063][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:27,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:28,360][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:28,684][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:29,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:29,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:29,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:30,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:30,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:31,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:32,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:32,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:32,081][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:33,002][__main__][INFO] - Iteration 580 took 23s (40.13% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 44m 39s. Estimated total time: 19h 28m 23s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 43s. [2025-11-13 11:50:33,004][__main__][INFO] - Starting iteration 580. [2025-11-13 11:50:33,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:33,007][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:42,439][__main__][INFO] - Number of regex retries in iteration 580: 0 [2025-11-13 11:50:42,439][__main__][INFO] - agents played in iteration 580 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:50:42,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:42,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:42,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:42,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:42,978][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:42,978][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:43,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:44,319][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:44,967][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:45,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:46,259][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:46,587][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:46,912][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:47,238][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:47,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:48,209][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:48,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:48,856][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:49,508][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:49,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:50,157][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:50,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:51,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:51,779][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:52,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:52,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:53,403][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:53,727][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:54,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:54,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:55,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:55,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:55,493][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:57,232][__main__][INFO] - Iteration 581 took 24s (38.93% Gen, 53.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 27m 11s. Estimated total time: 20h 11m 18s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 53s. [2025-11-13 11:50:57,234][__main__][INFO] - Starting iteration 581. [2025-11-13 11:50:57,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:50:57,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:05,808][__main__][INFO] - Number of regex retries in iteration 581: 0 [2025-11-13 11:51:05,808][__main__][INFO] - agents played in iteration 581 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:51:06,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,386][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:06,386][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:07,107][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:07,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:07,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:08,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:08,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:08,707][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:09,024][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:09,347][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:10,317][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:11,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:11,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:11,942][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:12,266][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:12,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:12,917][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:13,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:13,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:13,889][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:14,540][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:14,865][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:15,190][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:15,517][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:15,841][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:16,165][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:17,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:17,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:18,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:18,902][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:18,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:18,906][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:19,940][__main__][INFO] - Iteration 582 took 22s (37.75% Gen, 57.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 10m 41s. Estimated total time: 18h 55m 11s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 11s. [2025-11-13 11:51:19,942][__main__][INFO] - Starting iteration 582. [2025-11-13 11:51:19,946][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:19,946][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:28,665][__main__][INFO] - Number of regex retries in iteration 582: 0 [2025-11-13 11:51:28,666][__main__][INFO] - agents played in iteration 582 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:51:29,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:29,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:29,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:29,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:29,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:29,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:29,957][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:30,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:30,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:30,899][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:31,221][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:31,544][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:31,868][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:32,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:32,519][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:32,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:33,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:33,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:34,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:35,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:35,761][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:36,084][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:36,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:37,058][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:37,382][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:37,712][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:38,035][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:38,682][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:39,016][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:39,345][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:39,668][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:39,992][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:40,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:41,016][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:41,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:41,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:41,738][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:42,603][__main__][INFO] - Iteration 583 took 22s (38.48% Gen, 57.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 8m 0s. Estimated total time: 18h 52m 53s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 48s. [2025-11-13 11:51:42,605][__main__][INFO] - Starting iteration 583. [2025-11-13 11:51:42,607][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:42,608][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:51,735][__main__][INFO] - Number of regex retries in iteration 583: 0 [2025-11-13 11:51:51,736][__main__][INFO] - agents played in iteration 583 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:51:52,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:52,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:52,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:52,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:52,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:52,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:53,012][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:53,317][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:53,631][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:53,955][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:55,247][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:55,571][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:55,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:56,227][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:56,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:56,865][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:57,194][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:57,524][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:57,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:58,162][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:58,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:59,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:59,461][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:00,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:00,430][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:00,754][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:01,078][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:01,728][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:02,052][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:02,699][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:03,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:04,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:04,776][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:04,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:04,779][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:05,627][__main__][INFO] - Iteration 584 took 23s (39.65% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 43s. Estimated total time: 19h 10m 59s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 49s. [2025-11-13 11:52:05,629][__main__][INFO] - Starting iteration 584. [2025-11-13 11:52:05,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:05,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:13,964][__main__][INFO] - Number of regex retries in iteration 584: 0 [2025-11-13 11:52:13,965][__main__][INFO] - agents played in iteration 584 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:52:14,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:14,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:14,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:14,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:14,497][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:14,497][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:15,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:15,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:16,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:17,156][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:18,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:18,451][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:18,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:19,097][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:19,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:20,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:20,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:21,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:21,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:22,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:22,353][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:22,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:23,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:23,646][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:23,970][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:24,295][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:24,943][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:25,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:26,289][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:26,992][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:26,993][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:26,995][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:27,907][__main__][INFO] - Iteration 585 took 22s (37.40% Gen, 58.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 48m 9s. Estimated total time: 18h 33m 47s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 7s, 500 more iterations: 3h 5m 37s. [2025-11-13 11:52:27,909][__main__][INFO] - Starting iteration 585. [2025-11-13 11:52:27,912][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:27,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:36,940][__main__][INFO] - Number of regex retries in iteration 585: 0 [2025-11-13 11:52:36,940][__main__][INFO] - agents played in iteration 585 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:52:37,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:37,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:37,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:37,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:37,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:37,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:38,190][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:38,487][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:38,812][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:39,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:39,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:39,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:40,109][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:40,433][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:40,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:41,084][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:41,407][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:41,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:42,053][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:42,379][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:42,703][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:43,027][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:43,350][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:43,674][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:43,996][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:44,319][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:44,967][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:45,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:45,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:47,239][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:47,565][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:48,213][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:48,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:49,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:49,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:49,962][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:49,963][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:50,840][__main__][INFO] - Iteration 586 took 22s (39.37% Gen, 56.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 24s. Estimated total time: 19h 6m 25s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 4s. [2025-11-13 11:52:50,842][__main__][INFO] - Starting iteration 586. [2025-11-13 11:52:50,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:50,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:00,297][__main__][INFO] - Number of regex retries in iteration 586: 0 [2025-11-13 11:53:00,298][__main__][INFO] - agents played in iteration 586 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:53:00,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:00,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:00,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:00,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:00,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:00,842][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:01,561][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:01,856][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:02,180][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:02,504][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:02,828][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:03,157][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:04,122][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:04,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:04,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:05,095][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:05,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:05,753][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:06,728][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:07,382][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:08,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:08,366][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:08,686][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:09,009][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:09,981][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:10,306][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:10,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:10,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:11,281][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:11,604][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:11,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:12,635][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:13,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:13,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:13,341][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:14,245][__main__][INFO] - Iteration 587 took 23s (40.39% Gen, 55.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 43m 36s. Estimated total time: 19h 30m 0s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 0s. [2025-11-13 11:53:14,246][__main__][INFO] - Starting iteration 587. [2025-11-13 11:53:14,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:14,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:23,188][__main__][INFO] - Number of regex retries in iteration 587: 0 [2025-11-13 11:53:23,189][__main__][INFO] - agents played in iteration 587 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:53:23,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:23,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:23,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:23,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:23,729][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:23,729][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:24,767][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:25,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:26,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:26,713][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:27,040][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:27,691][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:28,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:28,663][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:28,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:29,313][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:29,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:29,970][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:30,296][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:30,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:31,272][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:31,595][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:31,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:32,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:32,571][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:32,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:33,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:34,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:34,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:35,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:36,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:36,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:36,246][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:37,343][__main__][INFO] - Iteration 588 took 23s (38.70% Gen, 56.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 27m 55s. Estimated total time: 19h 14m 42s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 27s. [2025-11-13 11:53:37,345][__main__][INFO] - Starting iteration 588. [2025-11-13 11:53:37,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:37,348][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:46,759][__main__][INFO] - Number of regex retries in iteration 588: 0 [2025-11-13 11:53:46,760][__main__][INFO] - agents played in iteration 588 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:53:47,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:47,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:47,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:47,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:47,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:47,302][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:48,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:49,622][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:49,949][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:50,273][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:50,596][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:50,921][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:51,244][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:51,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:52,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:52,870][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:54,165][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:54,488][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:54,812][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:55,459][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:55,783][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:56,109][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:56,432][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:56,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:57,083][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:57,728][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:58,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:58,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:59,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:59,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:59,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:59,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:00,681][__main__][INFO] - Iteration 589 took 23s (40.34% Gen, 55.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 39m 29s. Estimated total time: 19h 26m 40s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 26s. [2025-11-13 11:54:00,683][__main__][INFO] - Starting iteration 589. [2025-11-13 11:54:00,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:00,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:09,496][__main__][INFO] - Number of regex retries in iteration 589: 0 [2025-11-13 11:54:09,497][__main__][INFO] - agents played in iteration 589 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:54:09,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:09,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:10,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:10,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:10,037][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:10,038][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:10,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:11,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:11,410][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:11,741][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:12,065][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:12,388][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:12,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:13,048][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:13,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:13,696][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:14,020][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:14,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:15,328][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:16,308][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:16,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:16,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:17,293][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:17,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:17,949][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:18,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:18,923][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:19,247][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:19,571][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:19,895][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:20,218][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:20,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:20,870][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:21,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:21,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:22,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:22,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:22,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:23,537][__main__][INFO] - Iteration 590 took 22s (38.55% Gen, 57.43% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 15m 0s. Estimated total time: 19h 2m 33s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 5s, 500 more iterations: 3h 10m 25s. [2025-11-13 11:54:23,539][__main__][INFO] - Starting iteration 590. [2025-11-13 11:54:23,542][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:23,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:32,450][__main__][INFO] - Number of regex retries in iteration 590: 0 [2025-11-13 11:54:32,451][__main__][INFO] - agents played in iteration 590 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:54:32,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:32,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:32,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:33,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:33,015][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:33,015][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:33,728][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:34,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:34,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:34,674][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:34,998][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:35,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:35,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:35,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:36,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:36,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:36,951][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:37,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:37,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:38,914][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:40,535][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:41,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:41,513][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:41,837][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:42,489][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:43,465][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:43,789][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:44,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:44,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:45,489][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:45,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:45,492][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:47,345][__main__][INFO] - Iteration 591 took 23s (37.42% Gen, 54.78% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 2m 15s. Estimated total time: 19h 50m 12s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 22s. [2025-11-13 11:54:47,347][__main__][INFO] - Starting iteration 591. [2025-11-13 11:54:47,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:54:47,350][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:57,214][__main__][INFO] - Number of regex retries in iteration 591: 0 [2025-11-13 11:54:57,215][__main__][INFO] - agents played in iteration 591 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:54:57,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:57,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:57,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:57,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:57,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:57,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:58,484][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:58,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:59,105][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:59,751][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:00,399][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:01,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:01,370][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:01,694][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:02,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:02,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:02,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:03,324][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:04,299][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:05,273][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:05,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:05,921][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:06,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:06,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:06,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:07,542][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:07,866][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:08,191][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:08,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:08,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:09,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:10,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:10,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:10,250][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:11,202][__main__][INFO] - Iteration 592 took 23s (41.35% Gen, 54.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 4m 19s. Estimated total time: 19h 52m 40s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 45s, 500 more iterations: 3h 18m 46s. [2025-11-13 11:55:11,205][__main__][INFO] - Starting iteration 592. [2025-11-13 11:55:11,208][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:11,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:19,706][__main__][INFO] - Number of regex retries in iteration 592: 0 [2025-11-13 11:55:19,707][__main__][INFO] - agents played in iteration 592 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:55:20,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:20,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:20,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:20,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:20,252][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:20,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:20,970][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:21,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:21,590][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:21,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:22,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:22,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:22,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:23,210][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:23,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:24,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:24,835][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:25,163][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:25,488][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:25,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:26,463][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:27,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:28,098][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:28,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:28,746][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:29,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:30,369][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:30,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:31,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:31,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:32,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:32,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:32,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:32,744][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:33,665][__main__][INFO] - Iteration 593 took 22s (37.84% Gen, 58.05% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 54m 10s. Estimated total time: 18h 42m 53s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 25s, 500 more iterations: 3h 7m 8s. [2025-11-13 11:55:33,667][__main__][INFO] - Starting iteration 593. [2025-11-13 11:55:33,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:33,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:42,140][__main__][INFO] - Number of regex retries in iteration 593: 0 [2025-11-13 11:55:42,141][__main__][INFO] - agents played in iteration 593 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:55:42,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:42,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:42,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:42,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:42,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:42,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:43,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:43,714][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:44,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:44,684][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:45,009][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:45,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:45,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:45,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:46,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:47,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:47,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:48,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:49,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:49,539][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:49,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:50,189][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:50,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:50,841][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:51,165][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:51,825][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:52,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:52,800][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:53,124][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:53,448][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:53,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:54,490][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:55,170][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:55,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:55,172][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:56,097][__main__][INFO] - Iteration 594 took 22s (37.76% Gen, 58.10% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 52m 15s. Estimated total time: 18h 41m 22s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 22s, 500 more iterations: 3h 6m 53s. [2025-11-13 11:55:56,099][__main__][INFO] - Starting iteration 594. [2025-11-13 11:55:56,102][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:56,103][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:04,968][__main__][INFO] - Number of regex retries in iteration 594: 0 [2025-11-13 11:56:04,968][__main__][INFO] - agents played in iteration 594 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:56:05,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:05,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:05,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:05,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:05,509][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:05,510][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:06,223][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:06,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:06,843][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:07,489][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:07,815][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:08,139][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:08,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:09,116][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:09,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:09,768][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:10,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:10,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:11,071][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:11,398][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:11,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:12,046][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:12,699][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:13,016][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:13,341][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:13,990][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:14,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:15,286][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:15,609][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:15,932][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:16,255][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:16,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:17,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:18,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:18,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:18,016][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:18,878][__main__][INFO] - Iteration 595 took 22s (38.92% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 9m 21s. Estimated total time: 18h 58m 50s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 48s. [2025-11-13 11:56:18,880][__main__][INFO] - Starting iteration 595. [2025-11-13 11:56:18,883][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:18,883][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:27,384][__main__][INFO] - Number of regex retries in iteration 595: 0 [2025-11-13 11:56:27,384][__main__][INFO] - agents played in iteration 595 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:56:27,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:27,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:27,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:27,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:27,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:27,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:28,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:28,940][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:29,264][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:29,587][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:29,911][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:30,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:31,857][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:32,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:32,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:33,474][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:33,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:34,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:34,776][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:35,099][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:35,424][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:35,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:36,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:36,395][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:36,723][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:37,042][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:37,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:38,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:38,341][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:38,665][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:38,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:39,699][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:40,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:40,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:40,376][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:41,230][__main__][INFO] - Iteration 596 took 22s (38.04% Gen, 58.13% Train). Generation: 8s, Training: 12s. Estimated remaining time: 17h 47m 32s. Estimated total time: 18h 37m 24s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 14s, 500 more iterations: 3h 6m 14s. [2025-11-13 11:56:41,232][__main__][INFO] - Starting iteration 596. [2025-11-13 11:56:41,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:41,236][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:50,029][__main__][INFO] - Number of regex retries in iteration 596: 0 [2025-11-13 11:56:50,029][__main__][INFO] - agents played in iteration 596 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:56:50,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:50,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:50,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:50,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:50,571][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:50,572][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:51,297][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:52,893][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:53,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:54,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:54,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:54,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:55,165][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:56,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:56,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:57,119][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:57,770][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:58,094][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:58,418][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:59,069][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:59,717][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:00,041][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:00,365][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:00,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:01,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:02,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:03,072][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:03,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:03,075][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:04,002][__main__][INFO] - Iteration 597 took 22s (38.62% Gen, 57.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 8m 7s. Estimated total time: 18h 58m 21s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 43s. [2025-11-13 11:57:04,004][__main__][INFO] - Starting iteration 597. [2025-11-13 11:57:04,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:04,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:12,708][__main__][INFO] - Number of regex retries in iteration 597: 0 [2025-11-13 11:57:12,709][__main__][INFO] - agents played in iteration 597 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:57:13,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:13,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:13,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:13,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:13,246][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:13,246][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:13,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:14,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:14,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:15,230][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:15,553][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:15,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:16,199][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:16,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:16,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:17,169][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:17,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:17,818][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:18,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:18,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:19,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:19,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:19,777][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:20,429][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:20,754][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:21,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:21,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:22,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:22,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:23,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:23,676][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:24,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:25,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:25,738][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:25,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:25,742][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:26,651][__main__][INFO] - Iteration 598 took 22s (38.42% Gen, 57.55% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 1m 38s. Estimated total time: 18h 52m 15s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 44s, 500 more iterations: 3h 8m 42s. [2025-11-13 11:57:26,654][__main__][INFO] - Starting iteration 598. [2025-11-13 11:57:26,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:26,658][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:35,336][__main__][INFO] - Number of regex retries in iteration 598: 0 [2025-11-13 11:57:35,337][__main__][INFO] - agents played in iteration 598 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:57:35,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:35,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:35,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:35,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:35,876][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:35,876][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:37,252][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:37,579][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:37,902][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:38,226][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:39,535][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:40,190][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:40,514][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:40,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:41,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:41,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:42,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:42,463][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:42,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:43,111][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:43,757][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:44,082][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:44,406][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:45,054][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:45,381][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:45,709][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:46,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:46,358][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:46,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:47,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:47,714][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:48,405][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:48,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:48,408][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:49,288][__main__][INFO] - Iteration 599 took 22s (38.35% Gen, 57.75% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 37s. Estimated total time: 18h 51m 36s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 36s. [2025-11-13 11:57:49,291][__main__][INFO] - Starting iteration 599. [2025-11-13 11:57:49,294][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:49,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:57,630][__main__][INFO] - Number of regex retries in iteration 599: 0 [2025-11-13 11:57:57,630][__main__][INFO] - agents played in iteration 599 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:57:58,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:58,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:58,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:58,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:58,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:58,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:58,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:59,235][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:59,561][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:59,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:00,534][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:00,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:01,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:01,504][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:01,828][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:02,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:02,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:03,123][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:04,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:04,425][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:04,749][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:05,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:06,705][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:07,029][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:07,676][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:08,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:08,976][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:09,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:10,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:10,731][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:10,733][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:10,734][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:11,585][__main__][INFO] - Iteration 600 took 22s (37.40% Gen, 58.78% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 43m 15s. Estimated total time: 18h 34m 36s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 9s, 500 more iterations: 3h 5m 46s. [2025-11-13 11:58:11,587][__main__][INFO] - Starting iteration 600. [2025-11-13 11:58:11,590][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:58:11,591][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:20,639][__main__][INFO] - Number of regex retries in iteration 600: 0 [2025-11-13 11:58:20,640][__main__][INFO] - agents played in iteration 600 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:58:21,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:21,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:21,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:21,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:21,182][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:21,182][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:22,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:22,546][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:23,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:24,160][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:24,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:24,807][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:25,452][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:26,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:26,423][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:27,071][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:29,024][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:29,674][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:29,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:30,650][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:30,977][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:31,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:31,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:32,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:32,998][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:33,692][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:33,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:33,695][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:35,486][__main__][INFO] - Iteration 601 took 23s (37.87% Gen, 54.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 3m 4s. Estimated total time: 19h 54m 49s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 8s. [2025-11-13 11:58:35,488][__main__][INFO] - Starting iteration 601. [2025-11-13 11:58:35,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:58:35,491][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:44,433][__main__][INFO] - Number of regex retries in iteration 601: 0 [2025-11-13 11:58:44,434][__main__][INFO] - agents played in iteration 601 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:58:44,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:44,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:44,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:44,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:44,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:44,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:45,738][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:46,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:46,361][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:46,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:47,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:47,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:47,656][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:47,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:48,296][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:48,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:48,942][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:49,265][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:50,562][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:50,886][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:51,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:51,857][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:52,186][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:52,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:52,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:53,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:54,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:54,464][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:54,790][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:55,113][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:55,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:55,763][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:56,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:56,805][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:57,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:57,508][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:57,510][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:58,627][__main__][INFO] - Iteration 602 took 23s (38.65% Gen, 56.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 24m 41s. Estimated total time: 19h 16m 50s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 48s. [2025-11-13 11:58:58,629][__main__][INFO] - Starting iteration 602. [2025-11-13 11:58:58,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:58:58,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:07,944][__main__][INFO] - Number of regex retries in iteration 602: 0 [2025-11-13 11:59:07,945][__main__][INFO] - agents played in iteration 602 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:59:08,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:08,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:08,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:08,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:08,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:08,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:09,548][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:09,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:10,167][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:10,816][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:11,147][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:11,463][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:11,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:12,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:12,444][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:12,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:13,086][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:13,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:13,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:14,059][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:14,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:15,358][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:16,354][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:16,669][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:16,993][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:17,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:17,965][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:18,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:18,612][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:18,948][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:19,264][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:19,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:19,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:20,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:21,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:21,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:21,326][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:22,492][__main__][INFO] - Iteration 603 took 23s (39.03% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 28s. Estimated total time: 19h 53m 1s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 50s. [2025-11-13 11:59:22,494][__main__][INFO] - Starting iteration 603. [2025-11-13 11:59:22,497][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:22,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:31,265][__main__][INFO] - Number of regex retries in iteration 603: 0 [2025-11-13 11:59:31,266][__main__][INFO] - agents played in iteration 603 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:59:31,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:31,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:31,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:31,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:31,815][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:31,815][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:32,563][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:32,859][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:33,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:33,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:34,156][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:34,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:35,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:35,453][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:35,776][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:36,426][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:36,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:37,077][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:37,721][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:38,045][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:38,368][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:39,015][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:39,339][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:39,662][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:39,986][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:40,310][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:41,282][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:41,930][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:42,254][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:42,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:42,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:43,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:44,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:44,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:44,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:45,258][__main__][INFO] - Iteration 604 took 22s (38.52% Gen, 57.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 5m 11s. Estimated total time: 18h 58m 7s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 41s. [2025-11-13 11:59:45,260][__main__][INFO] - Starting iteration 604. [2025-11-13 11:59:45,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:45,264][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:53,733][__main__][INFO] - Number of regex retries in iteration 604: 0 [2025-11-13 11:59:53,734][__main__][INFO] - agents played in iteration 604 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 11:59:54,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:54,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:54,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:54,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:54,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:54,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:55,296][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:55,621][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:55,945][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:56,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:56,914][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:57,240][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:57,564][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:57,887][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:58,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:58,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:59,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:59,829][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:00,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:00,799][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:01,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:01,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:02,102][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:02,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:03,076][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:03,403][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:03,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:04,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:04,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:04,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:05,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:06,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:06,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:06,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:06,761][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:07,792][__main__][INFO] - Iteration 605 took 22s (37.59% Gen, 57.82% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 53m 12s. Estimated total time: 18h 46m 30s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 33s, 500 more iterations: 3h 7m 45s. [2025-11-13 12:00:07,795][__main__][INFO] - Starting iteration 605. [2025-11-13 12:00:07,798][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:07,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:16,692][__main__][INFO] - Number of regex retries in iteration 605: 0 [2025-11-13 12:00:16,692][__main__][INFO] - agents played in iteration 605 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:00:17,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:17,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:17,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:17,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:17,236][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:17,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:18,553][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:18,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:19,205][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:19,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:19,851][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:20,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:20,825][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:21,149][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:21,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:22,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:22,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:22,768][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:23,415][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:24,386][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:24,711][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:25,359][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:26,009][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:26,333][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:26,658][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:26,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:27,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:27,628][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:27,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:28,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:29,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:29,712][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:29,713][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:29,715][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:30,834][__main__][INFO] - Iteration 606 took 23s (38.60% Gen, 56.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 18m 8s. Estimated total time: 19h 11m 49s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 58s. [2025-11-13 12:00:30,836][__main__][INFO] - Starting iteration 606. [2025-11-13 12:00:30,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:30,839][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:39,294][__main__][INFO] - Number of regex retries in iteration 606: 0 [2025-11-13 12:00:39,294][__main__][INFO] - agents played in iteration 606 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:00:39,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:39,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:39,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:39,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:39,815][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:39,816][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:40,529][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:40,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:41,151][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:41,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:41,803][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:42,127][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:42,775][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:43,101][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:43,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:43,748][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:44,394][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:44,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:45,368][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:45,692][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:46,016][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:46,345][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:46,666][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:46,989][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:47,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:48,286][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:48,610][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:48,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:49,263][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:49,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:50,236][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:50,559][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:50,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:51,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:52,282][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:52,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:52,286][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:54,577][__main__][INFO] - Iteration 607 took 23s (35.62% Gen, 54.73% Train). Generation: 8s, Training: 12s. Estimated remaining time: 18h 52m 51s. Estimated total time: 19h 46m 56s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 49s. [2025-11-13 12:00:54,579][__main__][INFO] - Starting iteration 607. [2025-11-13 12:00:54,583][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:54,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:03,328][__main__][INFO] - Number of regex retries in iteration 607: 0 [2025-11-13 12:01:03,329][__main__][INFO] - agents played in iteration 607 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:01:03,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:03,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:03,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:03,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:03,857][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:03,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:04,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:04,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:05,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:05,498][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:05,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:06,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:06,469][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:06,795][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:07,120][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:07,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:07,766][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:08,089][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:08,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:08,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:09,059][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:09,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:09,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:10,031][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:10,356][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:11,005][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:11,329][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:11,979][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:12,302][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:12,625][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:12,948][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:13,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:13,920][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:14,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:15,606][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:16,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:16,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:16,283][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:17,937][__main__][INFO] - Iteration 608 took 23s (37.45% Gen, 55.47% Train). Generation: 8s, Training: 12s. Estimated remaining time: 18h 33m 18s. Estimated total time: 19h 27m 46s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 37s. [2025-11-13 12:01:17,939][__main__][INFO] - Starting iteration 608. [2025-11-13 12:01:17,942][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:17,943][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:27,140][__main__][INFO] - Number of regex retries in iteration 608: 0 [2025-11-13 12:01:27,141][__main__][INFO] - agents played in iteration 608 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:01:27,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:27,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:27,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:27,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:27,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:27,671][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:28,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:28,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:28,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:29,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:30,296][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:30,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:31,594][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:31,923][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:32,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:33,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:33,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:33,867][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:34,514][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:34,837][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:35,161][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:35,808][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:37,426][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:37,749][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:38,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:38,396][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:38,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:39,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:40,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:40,134][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:40,136][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:41,034][__main__][INFO] - Iteration 609 took 23s (39.83% Gen, 56.27% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 19m 46s. Estimated total time: 19h 14m 37s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 26s. [2025-11-13 12:01:41,036][__main__][INFO] - Starting iteration 609. [2025-11-13 12:01:41,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:41,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:46,352][mllm.models.large_language_model_local][WARNING] - Response >A did not match regex: (|), retry 1/1 [2025-11-13 12:01:50,692][__main__][INFO] - Number of regex retries in iteration 609: 1 [2025-11-13 12:01:50,692][__main__][INFO] - agents played in iteration 609 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:01:51,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:51,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:51,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:51,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:51,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:51,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:52,246][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:52,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:53,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:54,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:54,820][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:55,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:56,441][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:57,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:57,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:57,743][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:58,071][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:58,394][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:58,717][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:59,368][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:59,692][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:00,663][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:00,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:01,310][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:01,633][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:01,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:02,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:02,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:03,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:04,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:04,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:04,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:05,684][__main__][INFO] - Iteration 610 took 24s (39.16% Gen, 54.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 37m 0s. Estimated total time: 20h 32m 16s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 4s, 500 more iterations: 3h 25m 22s. [2025-11-13 12:02:05,686][__main__][INFO] - Starting iteration 610. [2025-11-13 12:02:05,690][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:02:05,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:14,625][__main__][INFO] - Number of regex retries in iteration 610: 0 [2025-11-13 12:02:14,625][__main__][INFO] - agents played in iteration 610 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:02:15,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:15,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:15,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:15,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:15,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:15,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:15,854][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:16,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:16,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:16,797][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:17,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:17,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:17,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:18,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:18,416][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:19,064][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:19,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:19,711][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:20,036][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:21,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:21,344][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:21,675][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:21,999][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:22,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:22,645][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:22,977][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:23,301][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:23,625][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:24,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:24,597][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:24,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:25,243][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:25,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:25,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:26,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:26,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:27,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:27,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:27,643][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:29,353][__main__][INFO] - Iteration 611 took 23s (37.76% Gen, 55.01% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 47m 33s. Estimated total time: 19h 43m 12s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 12s. [2025-11-13 12:02:29,355][__main__][INFO] - Starting iteration 611. [2025-11-13 12:02:29,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:02:29,359][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:38,597][__main__][INFO] - Number of regex retries in iteration 611: 0 [2025-11-13 12:02:38,598][__main__][INFO] - agents played in iteration 611 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:02:39,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:39,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:39,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:39,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:39,127][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:39,127][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:39,798][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:40,093][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:40,419][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:41,067][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:41,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:41,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:42,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:43,023][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:43,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:43,995][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:44,320][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:44,645][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:44,968][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:45,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:45,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:46,590][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:47,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:47,562][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:48,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:48,534][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:49,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:49,832][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:50,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:50,899][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:51,573][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:51,574][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:51,576][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:52,713][__main__][INFO] - Iteration 612 took 23s (39.56% Gen, 55.57% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 31m 41s. Estimated total time: 19h 27m 44s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 37s. [2025-11-13 12:02:52,715][__main__][INFO] - Starting iteration 612. [2025-11-13 12:02:52,718][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:02:52,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:02,016][__main__][INFO] - Number of regex retries in iteration 612: 0 [2025-11-13 12:03:02,017][__main__][INFO] - agents played in iteration 612 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:03:02,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:02,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:02,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:02,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:02,554][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:02,555][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:03,227][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:03,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:03,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:04,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:04,494][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:05,142][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:05,464][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:06,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:06,444][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:06,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:07,416][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:07,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:08,063][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:08,387][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:08,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:09,366][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:10,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:11,646][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:11,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:12,297][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:12,621][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:12,944][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:13,267][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:13,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:14,318][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:14,998][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:14,999][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:15,001][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:15,834][__main__][INFO] - Iteration 613 took 23s (40.22% Gen, 56.17% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 19m 24s. Estimated total time: 19h 15m 50s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 38s. [2025-11-13 12:03:15,836][__main__][INFO] - Starting iteration 613. [2025-11-13 12:03:15,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:15,839][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:25,032][__main__][INFO] - Number of regex retries in iteration 613: 0 [2025-11-13 12:03:25,032][__main__][INFO] - agents played in iteration 613 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:03:25,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:25,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:25,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:25,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:25,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:25,577][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:26,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:26,556][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:26,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:27,207][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:27,530][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:28,184][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:29,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:29,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:30,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:30,461][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:30,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:31,434][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:31,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:32,082][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:32,413][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:32,730][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:33,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:34,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:34,349][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:34,672][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:35,003][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:35,319][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:35,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:35,965][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:36,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:37,326][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:38,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:38,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:38,020][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:38,866][__main__][INFO] - Iteration 614 took 23s (39.92% Gen, 56.40% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 14m 36s. Estimated total time: 19h 11m 25s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 54s. [2025-11-13 12:03:38,868][__main__][INFO] - Starting iteration 614. [2025-11-13 12:03:38,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:38,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:47,932][__main__][INFO] - Number of regex retries in iteration 614: 0 [2025-11-13 12:03:47,933][__main__][INFO] - agents played in iteration 614 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:03:48,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:48,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:48,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:48,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:48,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:48,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:49,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:50,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:50,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:50,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:51,052][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:51,376][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:51,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:52,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:52,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:52,996][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:53,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:53,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:54,291][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:54,616][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:54,941][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:55,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:55,916][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:56,238][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:56,562][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:56,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:57,214][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:57,535][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:58,507][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:58,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:59,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:59,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:00,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:00,902][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:00,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:00,904][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:01,819][__main__][INFO] - Iteration 615 took 22s (39.48% Gen, 56.53% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 10m 13s. Estimated total time: 19h 7m 26s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 14s. [2025-11-13 12:04:01,821][__main__][INFO] - Starting iteration 615. [2025-11-13 12:04:01,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:01,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:10,380][__main__][INFO] - Number of regex retries in iteration 615: 0 [2025-11-13 12:04:10,380][__main__][INFO] - agents played in iteration 615 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:04:10,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:10,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:10,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:10,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:10,920][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:10,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:11,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:11,892][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:12,217][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:12,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:13,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:13,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:14,154][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:14,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:14,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:15,128][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:15,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:16,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:16,751][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:17,076][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:17,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:17,725][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:18,702][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:19,674][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:19,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:20,322][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:20,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:20,971][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:21,297][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:21,621][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:21,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:22,665][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:23,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:23,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:23,338][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:24,264][__main__][INFO] - Iteration 616 took 22s (38.13% Gen, 57.74% Train). Generation: 8s, Training: 12s. Estimated remaining time: 17h 44m 28s. Estimated total time: 18h 42m 2s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 24s, 500 more iterations: 3h 7m 0s. [2025-11-13 12:04:24,266][__main__][INFO] - Starting iteration 616. [2025-11-13 12:04:24,270][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:24,270][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:33,408][__main__][INFO] - Number of regex retries in iteration 616: 0 [2025-11-13 12:04:33,409][__main__][INFO] - agents played in iteration 616 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:04:33,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:33,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:33,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:33,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:33,948][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:33,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:34,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:34,921][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:35,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:36,217][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:36,544][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:36,873][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:37,196][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:37,850][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:38,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:38,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:38,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:39,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:39,476][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:39,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:40,123][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:40,447][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:40,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:41,099][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:41,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:41,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:42,725][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:43,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:43,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:44,026][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:44,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:45,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:45,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:46,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:46,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:46,410][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:47,477][__main__][INFO] - Iteration 617 took 23s (39.37% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 28s. Estimated total time: 19h 20m 26s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 24s. [2025-11-13 12:04:47,479][__main__][INFO] - Starting iteration 617. [2025-11-13 12:04:47,482][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:47,482][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:56,802][__main__][INFO] - Number of regex retries in iteration 617: 0 [2025-11-13 12:04:56,803][__main__][INFO] - agents played in iteration 617 are Bob, Alice, Bob_buffer, Alice_buffer [2025-11-13 12:04:57,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:57,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:57,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:57,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:57,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:57,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:58,321][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:59,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:59,938][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:05:00,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:05:00,587][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:05:00,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:05:01,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:05:01,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:05:01,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:05:02,206][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:05:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:05:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:05:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:05:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:05:03,828][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:05:04,152][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:05:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:05:04,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:05:05,124][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:05:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:05:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:05:06,096][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:05:06,419][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:05:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:05:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:05:07,390][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:05:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:05:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:05:08,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:05:09,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:05:09,771][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:05:09,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:05:09,774][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed1111_bs128/seed_1111/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:05:10,873][__main__][INFO] - Iteration 618 took 23s (39.84% Gen, 55.45% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 31m 15s. Estimated total time: 19h 29m 36s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 56s. [2025-11-13 12:05:10,875][__main__][INFO] - Starting iteration 618. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:14,964][asyncio][WARNING] - socket.send() raised exception.